Parallel Computer Architecture Lars Karlsson Ume a University - PowerPoint PPT Presentation

Parallel Computer Architecture Lars Karlsson Ume˚ a University 2009-12-07 Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 1 / 52

Topics Covered Multicore processors Short vector instructions (SIMD) Advanced instruction level parallelism Cache coherence Hardware multithreading Sample multicore processors Introduction to parallel programming Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 2 / 52

Part I Introduction Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 3 / 52

Moore’s Law Moore’s law predicts an exponential growth in the number of transistors per chip Observed exponential growth over the last couple of decades Appears to continue at least another decade Enables the construction of faster processors Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 4 / 52

Turning Transistors into Performance The old approach Speed up a single instruction stream: Increase clock frequency Pipeline the execution of instructions Predict branches to reduce overhead of pipeline stalls Issue several instructions per clock Schedule instructions out-of-order Use short vector instructions (SIMD) Hide memory latency with a multilevel cache hierarchy Conclusion: relies on Instruction Level Parallelism (ILP) Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 5 / 52

Limits of the Old Approach The Power Wall ◮ Power consumption depends linearly on the clock frequency ◮ Power leads to heat ◮ Power is expensive ◮ Frequency around 2–3 GHz since 2001 ◮ Prior to 2001: exponential growth over several decades The ILP Wall ◮ Already, few applications utilize all functional units ◮ Sublinear return on invested resources (transistors/power) ◮ Diminishing returns Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 6 / 52

Turning Transistors into Performance The new approach: multicore architectures Several cores on one die – increases peak performance Reduce the clock frequency – saves power Use simpler core design – frees transistors Which of the following choices lead to the highest performance? All cores identical: homogeneous multicore Different types of cores: heterogeneous multicore Clearly, heterogeneous multicores are potentially harder to program. Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 7 / 52

Heterogeneous Multicores A Simple Model for Building Heterogeneous Multicores Consider the following core designs: Small: 1 unit of area, 1 unit of performance Medium: 4 units of area, 2 units of performance Large: 16 units of area, 4 units of performance Suppose we have 16 units of die area. Consider these processors: Large: 1 large core ◮ 4 units of sequential performance ◮ 4 units of parallel performance Medium/Homo: 4 medium cores ◮ 2 units of sequential performance ◮ 8 units of parallel performance Small/Homo: 16 small cores ◮ 1 unit of sequential performance ◮ 16 units of parallel performance Hetero: 1 medium and 12 small cores ◮ 2 units of sequential performance ◮ 14 units of parallel performance Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 8 / 52

Heterogeneous Multicores Evaluating Design Choices Partition an algorithm’s execution time Serial fraction f ∈ [0 , 1]: no parallel speedup possible ◮ f ≈ 1: sequential algorithm (very rare) ◮ f ≈ 0: perfectly parallel algorithm (quite common) Parallel fraction (1 − f ): perfect parallel speedup Performance as a function of f : 16 Large Medium/Homo 14 Small/Homo Hetero 12 10 Performance 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 f Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 9 / 52

Memory System Machine characteristics Peak computational performance Memory bandwidth Memory latency The first two impose hardware limits on performance Compute-bound, e.g., most of dense linear algebra Memory-bound, e.g., most of sparse linear algebra Latency-bound, e.g., finite state machines Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 10 / 52

Memory System Compute-Bound vs Memory-Bound Sample difference in performance between a compute-bound and a memory-bound algorithm on Akka @ HPC2N 80 Compute−bound Memory−bound 70 60 50 Gflop/s 43x 40 4x 30 20 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Matrix Size Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 11 / 52

Obtaining Peak Floating Point Performance To obtain peak performance, an algorithm must Have a high arithmetic intensity Exploit the ISA effectively Parallelize over all cores Exploiting the ISA effectively means Balancing the number of multiplies with adds ◮ Fused multiply and add (FMA) ◮ Adder and multiplier in parallel Using SIMD instructions Having enough instruction level parallelism (ILP) Having a predictable control flow Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 12 / 52

Part II SISD / MIMD / SIMD Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 13 / 52

Flynn’s Taxonomy Flynn’s taxonomy classifies parallel computers based on Number of instruction streams Number of data streams Instr. / Data Single Multi Single SISD SIMD Multi MIMD MISD SISD: Uniprocessor MIMD: Multicores/clusters SIMD: Vector processors/instructions MISD: ??? Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 14 / 52

Single Instruction Multiple Data (SIMD) Several ALUs operating synchronously in parallel ◮ Same instruction stream ◮ Different data streams Several variants ◮ SIMD/Vector instructions ◮ Different control flows Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 15 / 52

SIMD Programming Issue Logic SSE example (Intel C intrinsics): ALU ALU ALU ALU __m128 vecA, vecB, vecC; vecC = _mm_add_ps(vecA, vecB); 4−Vector Addition a 1 2 3 4 Vector data types + Vector operations 5 4 3 2 b = 6 6 6 6 c Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 16 / 52

MIMD: Shared vs Distributed Address Space A key issue in MIMD design is whether to support a shared address space programming model (abbreviated shared memory) in hardware or not (distributed memory). Distributed memory ◮ Each process has its own address space ◮ Explicit communication (message passing) Shared memory ◮ Each process shares a global address space ◮ Implicit communication (reads/writes + synchronization) Supporting shared memory in hardware leads to various issues: What if two threads access the same memory location? How to manage multiple cached copies of a memory location? How to synchronize the threads? Supporting distributed memory is much simpler. Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 17 / 52

MIMD: Synchronization Thread cooperation requires that some threads write data that other threads read. To avoid corrupted results, it is necessary to synchronize the threads to avoid data races. Definition (Data Race) A data race refers to two memory accesses by different threads in which at least one is a write and the two accesses occur one after another. With data races present, the output depends on the execution order. Without any data races, the program is correctly synchronized. Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 18 / 52

MIMD: Synchronization Hardware Support Atomic read/write instructions are not strong enough ◮ Synchronization primitives too expensive to implement ◮ The cost grows with the number of processors Atomic read-modify-write required ◮ Atomic exchange ◮ Fetch-and-increment ◮ Test-and-set ◮ Compare-and-swap ◮ Load linked – store conditional Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 19 / 52

MIMD: Synchronization Implementing a Lock with Atomic Exchange Represent the state of a lock (locked/free) by an integer ◮ 0: free ◮ 1: locked Locking: ◮ Atomically exchange the lock variable with 1 ◮ (i) returns 0: lock was free and is now locked – OK! ◮ (ii) returns 1: lock was locked and is still locked – Retry! Precisely one thread will succeed since the operations are ordered by the hardware. Unlocking: ◮ Overwrite the lock variable with 0 Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 20 / 52

Compiling for SIMD/Shared Memory/Distributed Memory Compiling for SIMD instructions ◮ Alignment ◮ Data structures Compiling for shared memory ◮ Loop-level parallelism ◮ Best strategy depends on usage pattern ◮ Speculative multithreading Compiling for distributed memory ◮ Data distribution ◮ Communication Summary: very difficult to compile for parallel architectures. Programmers are responsible for almost all parallelizations. Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 21 / 52

Part III Advanced Instruction Level Parallelism Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 22 / 52

Parallel Computer Architecture Lars Karlsson Ume a University - PowerPoint PPT Presentation

Parallel Computer Architecture Lars Karlsson Ume a University 2009-12-07 Lars Karlsson (Ume a University) Parallel Computer Architecture 2009-12-07 1 / 52 Topics Covered Multicore processors Short vector instructions (SIMD) Advanced

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

The eXplicit MultiThreading (XMT) Parallel Computer Architecture Parallel Computer Architecture

Cap 1 Introduction Introduction What is Parallel Architecture? Why Parallel Architecture?

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

CS184c: Computer Architecture [Parallel and Multithreaded] Day 11: May10, 2001 Data Parallel

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong Cao Architecture Goals

MS degree in Computer Engineering University of Rome Tor Vergata Lecturer: Francesco Quaglia

Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Tensor Methods for large-scale Machine Learning Anima Anandkumar U.C. Irvine Learning with Big

CSL 860: Modern Parallel Computation Computation Course Information

RNS Arithmetic Approach in Lattice-based Cryptography Accelerating the Rounding-off Core

SIPE: Small Integer Plus Exponent Vincent LEFVRE AriC, INRIA Grenoble Rhne-Alpes / LIP,

High-speed Diffie-Hellman, part 2 D. J. Bernstein University of Illinois at Chicago Classic