2 Introduction to parallel computing Chip Multiprocessors (ACS - PowerPoint PPT Presentation

2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins

Overview • Parallel computing platforms – Approaches to building parallel computers – Today's chip-multiprocessor architectures • Approaches to parallel programming – Programming with threads and shared memory – Message-passing libraries – PGAS languages – High-level parallel languages Chip Multiprocessors (ACS MPhil) 2

Parallel computers • How might we exploit multiple processing elements and memories in order to complete a large computation quickly? – How many processing elements, how powerful? – How do they communicate and cooperate? • How are memories and processing elements interconnected? • How is the memory hierarchy organised? – How might we program such a machine? Chip Multiprocessors (ACS MPhil) 3

The control structure • How are the processing elements controlled? – Centrally from single control unit or can they work independently? • Flynn's taxonomy: • Single Instruction Multiple Data ( SIMD ) • Multiple Instruction Multiple Data ( MIMD ) Chip Multiprocessors (ACS MPhil) 4

The control structure • SIMD – The scalar pipelines execute in lockstep – Data-independent logic is shared • Efficient for highly data parallel applications • Much simpler instruction fetch and supply mechanism – SIMD hardware can support a SPMD model if A Generic Streaming Multiprocessor (for graphics applications) the individual threads follow similar control flow • Masked execution Reproduced from, " Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow ", W. W. L. Fung et al Chip Multiprocessors (ACS MPhil) 5

The communication model • A clear distinction is made between two common communication models: – 1. Shared-address-space platforms • All processors have access to a shared data space accessed via a shared address space • All communication takes place via a shared memory • Each processing element may also have an area of memory that is private Chip Multiprocessors (ACS MPhil) 6

The communication model • 2. Message-passing platforms – Each processing element has its own exclusive address space – Communication is achieved by sending explicit messages between processing elements – The sending and receiving of messages can be used to both communicate between and synchronize the actions of multiple processing elements Chip Multiprocessors (ACS MPhil) 7

Multi-core Figure courtesy of Tim Harris, MSR Chip Multiprocessors (ACS MPhil) 8

SMP multiprocessor Figure courtesy of Tim Harris, MSR Chip Multiprocessors (ACS MPhil) 9

NUMA multiprocessor Figure courtesy of Tim Harris, MSR Chip Multiprocessors (ACS MPhil) 10

Message-passing platforms • Many early message- passing machines provided hardware primitives that 101 100 were close to the send/receive user-level communication commands 001 000 – e.g. a pair of processors may be interconnected with 111 110 a hardware FIFO queue – The network topology restricted which processors could be named in a send or 011 010 receive operation ( e.g. only neighbours could communicate in a mesh network) [Culler, Figure 1.22] Chip Multiprocessors (ACS MPhil) 11

Message-passing platforms • The Transputer (1984) – The result of an earlier foray into the world of parallel computing! – Transputer contained integrated serial links for building multiprocessors • IN/OUT instructions in ISA for sending and receiving messages – Programmed in OCCAM (based on CSP) • IBM Victor V256 (1991) – 16x16 array of transputers – The processors could be partitioned dynamically between different users Chip Multiprocessors (ACS MPhil) 12

Message-passing platforms • Recently some chip- multiprocessors have taken a similar approach (RAW/Tilera and XMOS) – Message queues (or communication channels) may be register mapped or accessed via special instructions – The processor stalls when reading an empty input queue or when trying to write to a full output buffer A wireless application mapped to the RAW processor. Data is streamed from one core to another over a statically scheduled network. Network input (See also the iWarp paper on wiki) and output is register mapped. Chip Multiprocessors (ACS MPhil) 13

Message-passing platforms • For larger message-passing machines (typically scientific supercomputers) direct FIFO designs were soon replaced by designs that built message-passing upon remote memory copies (supported by DMA or a more general communication assist processor) – The interconnection networks also became more powerful, supporting the automatic routing of messages between arbitrary nodes – No restrictions on programmer or software support required • Hardware and software evolution meant there was a general convergence of parallel machine organisations Chip Multiprocessors (ACS MPhil) 14

Message-passing platforms • The most fundamental communication primitives in a message-passing machine are synchronous send and receive operations – Here data movement must be specified at both ends of the communication, this is known as two-sided communication . e.g. MPI_Send and MPI_Recv* – Non-blocking versions of send and receive are also often provided to allow computation and communication to be overlapped *Message Passing Interface (MPI) is a portable message-passing system that is supported by a very wide range of parallel machines. Chip Multiprocessors (ACS MPhil) 15

One-side communication • SHMEM – Provides routines to access the memory of a remote processing element without any assistance from the remote process, e.g: • shmem_put (target_addr, source_addr, length, remote_pe) • shmem_get, shmem_barrier etc. – One-sided communication may be used to reduce synchronization, simplify programming and reduce data movement Chip Multiprocessors (ACS MPhil) 16

The communication model • From a hardware perspective we would like to keep the machine simple (message-passing) • But we inevitably need to simplify the programmer's and compiler's task – Efficiently support shared-memory programming – Add support for transactional memory? – Create a simple but high-performance target • Trade-offs between hardware complexity and complexity of hardware and compiler. Chip Multiprocessors (ACS MPhil) 17

Today's chip multiprocessors • Intel Nehalem-EX (2009) – 8-cores • 2-way hyperthreaded (SMT) • 16 hardware threads – L1I 32KB, L1D 32KB – 256 KB L2 (Private) – 24MB L3 (Shared) • 8-banks • Inclusive L3 Chip Multiprocessors (ACS MPhil) 18

Today's chip multiprocessors Intel Nahalem-EX (2009) L1 L2 Shared L3 Memory Chip Multiprocessors (ACS MPhil) 19

Today's chip multiprocessors • IBM Power 7 (2010) – 8 core (dual-chip module to hold 16 cores) – 32MB shared eDRAM L3 cache – 2-channel DDR3 controllers – Individual cores • 4-thread SMT per core • 6 ops/cycle • 4GHz Chip Multiprocessors (ACS MPhil) 20

Today's chip multiprocessors IBM Power 7 (2010) Chip Multiprocessors (ACS MPhil) 21

Today's chip multiprocessors • Sun Niagara T1 (2005) Each core has its own level 1 cache (16KB for instructions, 8KB for data). The level 2 caches are 3MB in total and are effectively 12-way associative. They are interleaved by 64-byte cache lines. Chip Multiprocessors (ACS MPhil) 22

Oracle M7 Processor (2014) • 32 core – Dual-issue, OOO • Dynamic multithreading 1-8 threads/core • 256KB I&D L2 caches shared by groups of 4 cores • 64MB L3 • Technology: 20nm, 13 metal layers • 16 DDR channels – 160GB/s – (vs. ~20GB/s for T1) • >10B transistors! Chip Multiprocessors (ACS MPhil) 23

“Manycore” designs: Tilera • Tilera (now Mellanox) – Evolution of MIT RAW – 100-cores – grid of identical tiles – Low-power 3-way VLIW cores – Cores interconnected by a selection of static and dynamic on-chip networks Chip Multiprocessors (ACS MPhil) 24

“Manycore” designs: Celerity (2017) Tiered Accelerator Fabric General-purpose tier: 5 “Rocket” RISC-V cores Massively parallel tier: 496 5-stage RISC-V cores, 16x31 tiled mesh array Specialised tier: Binarized Neural Network accelerator Chip Multiprocessors (ACS MPhil) 25

GPUs • TESLA P100 – 56 Streaming multiprocessors x 64 cores = 3584 “cores” or lanes – 732GB/s memory bandwidth – 4MB L2 cache – 15.3 billion transistors “The NVIDIA GeForce 8800 GPU”, Hot Chips 2007 Chip Multiprocessors (ACS MPhil) 26

Communication latencies • Chip multiprocessor – Some have very fast core to core communication, as low as 1-3 cycles – Opportunities to add dedicated core-to-core links – Typical L1-to-L1 communication latencies may be around 10-100 cycles • Other types of parallel machine: – Shared memory multiprocessor ~500 – Cluster/supercomputer ~5000-10000 Chip Multiprocessors (ACS MPhil) 27

Approaches to parallel programming • “ Principles of Parallel Programming ”, Calvin Lin and Lawrence Snyder, Pearson, 2009 • This book provides a good overview of the different approaches to parallel programming • There is also a significant amount of information on the course wiki – Try some examples! Chip Multiprocessors (ACS MPhil) 28

2 Introduction to parallel computing Chip Multiprocessors (ACS - PowerPoint PPT Presentation

2 Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Parallel computing platforms Approaches to building parallel computers Today's chip-multiprocessor architectures Approaches to

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

CSC2/458 Parallel and Distributed Systems Introduction Sreepathi Pai January 18, 2018 URCS

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Bike-a-gami Bike-a-gami Current products Folding bikes Full sized bikes Pr Problems: Pr

Food Safety Laboratories of the future Use of drones, smartphones and mobile mass spectrometry

Towards compact transportable atom-interferometric inertial sensors G. Stern (SYRTE/LCFIO)

Instrumentation available for portable/transportable X-ray spectrometry techniques A. Migliori

CSEP504: Advanced topics in software systems Software architecture: design-based Formal

Idealized Parallel Computers Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015

FUN LITE - A PARALLEL P ETRI N ET S IMULATOR Jochen Spranger jsp@informatik.tu-cottbus.de

Multicore debugging from a SW compiler perspective Marco Roodzant, marco@ace.nl ACE Associated

2 Introduction to parallel computing Chip Multiprocessors (ACS - PowerPoint PPT Presentation

2 Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Parallel computing platforms Approaches to building parallel computers Today's chip-multiprocessor architectures Approaches to

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

CSC2/458 Parallel and Distributed Systems Introduction Sreepathi Pai January 18, 2018 URCS

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Bike-a-gami Bike-a-gami Current products Folding bikes Full sized bikes Pr Problems: Pr

Food Safety Laboratories of the future Use of drones, smartphones and mobile mass spectrometry

Towards compact transportable atom-interferometric inertial sensors G. Stern (SYRTE/LCFIO)

Instrumentation available for portable/transportable X-ray spectrometry techniques A. Migliori

CSEP504: Advanced topics in software systems Software architecture: design-based Formal

Idealized Parallel Computers Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015

FUN LITE - A PARALLEL P ETRI N ET S IMULATOR Jochen Spranger jsp@informatik.tu-cottbus.de

Multicore debugging from a SW compiler perspective Marco Roodzant, marco@ace.nl ACE Associated

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &