Lecture notes for CS 433 - Chapter 4 11/7/2019 Chapter 5: - PDF document

Lecture notes for CS 433 - Chapter 4 11/7/2019 Chapter 5: Thread-Level Parallelism – Part 1 What is a parallel or multiprocessor system? Introduction Multiple processor units working together to solve the same problem What is a parallel or multiprocessor system? Key architectural issue: Communication model Why parallel architecture? Performance potential Flynn classification Communication models Architectures Centralized shared-memory Distributed shared-memory Parallel programming Synchronization Memory consistency models 1 2 Why parallel architectures? Performance Potential Absolute performance Amdahl's Law is pessimistic Let s be the serial part Technology and architecture trends Let p be the part that can be parallelized n ways Dennard scaling, ILP wall, Moore’s law Serial: SSPPPPPP  Multicore chips 6 processors: SSP P Connect multicore together for even more parallelism P P P P Speedup = 8/3 = 2.67 1 T(n) = s+p/n 1 As n →  , T(n) → s Pessimistic 3 4 Sarita Adve 1

Lecture notes for CS 433 - Chapter 4 11/7/2019 Performance Potential (Cont.) Performance Potential (Cont.) Gustafson's Corollary (Cont.) Gustafson's Corollary Assume for larger problem sizes Amdahl's law holds if run same problem size on larger machines Serial time fixed (at s) But in practice, we run larger problems and ''wait'' the same time Parallel time proportional to problem size (truth more complicated) Old Serial: SSPPPPPP 6 processors: SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP Hypothetical Serial: SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP Speedup = (8 + 5*6)/8 = 4.75 T'(n) = s + n*p; T'(  ) →  !!!! How does your algorithm ''scale up''? 5 6 Flynn classification Communication models Single-Instruction Single-Data (SISD) Shared-memory Single-Instruction Multiple-Data (SIMD) Message passing Multiple-Instruction Single-Data (MISD) Data parallel Multiple-Instruction Multiple-Data (MIMD) 7 8 Sarita Adve 2

Lecture notes for CS 433 - Chapter 4 11/7/2019 Communication Models: Shared-Memory Communication Models: Message Passing P P P P M P M P M interconnect interconnect MMMMMMM Each node a computer Processor – runs its own program (like SM) Each node a processor that runs a process Memory – local to that node, unrelated to other memory One shared memory Add messages for internode communication, send and receive like Accessible by any processor mail The same address on two different processors refers to the same datum Therefore, write and read memory to Store and recall data Communicate, Synchronize (coordinate) 9 10 Communication Models: Data Parallel Architectures P M P M P M All mechanisms can usually be synthesized by all hardware Key: which communication model does hardware support best? interconnect Virtually all small-scale systems, multicores are shared-memory Virtual processor per datum Write sequential programs with ''conceptual PC'' and let parallelism be within the data (e.g., matrices) C = A + B Typically SIMD architecture, but MIMD can be as effective 11 12 Sarita Adve 3

Lecture notes for CS 433 - Chapter 4 11/7/2019 Which is Best Communication Model to Support? Shared-Memory Architecture Shared-memory The model PROC PROC PROC Used in small-scale systems Easier to program for dynamic data structures Lower overhead communication for small data Implicit movement of data with caching Hard to build? INTERCONNECT Message-passing Communication explicit harder to program? Larger overheads in communication OS intervention? Easier to build? MEMORY For now, assume interconnect is a bus – centralized architecture 13 14 Centralized Shared-Memory Architecture Centralized Shared-Memory Architecture (Cont.) For higher bandwidth (throughput) PROC PROC PROC BUS For lower latency MEMORY Problem? 15 16 Sarita Adve 4

Lecture notes for CS 433 - Chapter 4 11/7/2019 Cache Coherence Problem Cache Coherence Solutions Snooping PROC 1 PROC 2 PROC n PROC 1 PROC 2 PROC n A CACHE A CACHE BUS BUS MEMORY MEMORY MEMORY MEMORY A A Problem with centralized architecture 17 18 Distributed Shared-Memory (DSM) Architecture Distributed Shared-Memory (DSM) - Cont. Use a higher bandwidth interconnection network For lower latency: Non-Uniform Memory Access architecture (NUMA) PROC 1 PROC 2 PROC n CACHE CACHE CACHE GENERAL INTERCONNECT MEMORY MEMORY MEMORY Uniform memory access architecture (UMA) 19 20 Sarita Adve 5

Lecture notes for CS 433 - Chapter 4 11/7/2019 Non-Bus Interconnection Networks Distributed Shared-Memory - Coherence Problem Example interconnection networks Directory scheme PROC PROC PROC MEM MEM MEM CACHE CACHE CACHE SWITCH/NETWORK Level of indirection! 21 22 Parallel Programming Example Parallel Program Example (Cont.) Add two matrices: C = A + B Sequential Program main(argc, argv) int argc; char *argv; { Read(A); Read(B); for (i = 0; i ! N; i++) for (j = 0; j ! N; j++) C[i,j] = A[i,j] + B[i,j]; Print(C); } 23 24 Sarita Adve 6

Lecture notes for CS 433 - Chapter 4 11/7/2019 The Parallel Programming Process Synchronization Communication – Exchange data Synchronization – Exchange data to order events Mutual exclusion or atomicity Event ordering or Producer/consumer Point to Point Flags Global Barriers 25 26 Mutual Exclusion Mutual Exclusion Primitives Example Hardware instructions Each processor needs to occasionally update a counter Test&Set Atomically tests for 0 and sets to 1 Processor 1 Processor 2 Unset is simply a store of 0 Load reg1, Counter Load reg2, Counter while (Test&Set(L) != 0) {;} reg1 = reg1 + tmp1 reg2 = reg2 + tmp2 Critical Section Store Counter, reg1 Store Counter, reg2 Unset(L) Problem? 27 28 Sarita Adve 7

Lecture notes for CS 433 - Chapter 4 11/7/2019 Mutual Exclusion Primitives – Alternative? Mutual Exclusion Primitives – Fetch&Add Fetch&Add(var, data) Test&Test&Set { /* atomic action */ temp = var var = temp + data } return temp E.g., let X = 57 P1: a = Fetch&Add(X,3) P2: b = Fetch&Add(X,5) If P1 before P2, ? If P2 before P1, ? If P1, P2 concurrent ? 29 30 Global Event Ordering – Barriers Point to Point Event Ordering Example Example Producer wants to indicate to consumer that data is ready All processors produce some data Want to tell all processors that it is ready Processor 1 Processor 2 In next phase, all processors consume data produced previously A[1] = … … = A[1] A[2] = … … = A[2] Use barriers . . . . A[n] = … … = A[n] 31 32 Sarita Adve 8

Lecture notes for CS 433 - Chapter 4 11/7/2019 Chapter 5: - PDF document

Lecture notes for CS 433 - Chapter 4 11/7/2019 Chapter 5: Thread-Level Parallelism Part 1 What is a parallel or multiprocessor system? Introduction Multiple processor units working together to solve the same problem What is a parallel or

Legionella Detection Test Kits sales@novatech-usa.com www.novatech-usa.com Tel: (866) 433-6682

Lecture notes for CS 433 - Chapter 2, part 2 9/26/18 Branch Prediction Buffer Strategies:

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

433-380 Graphics and Computation Department of Computer Science and Software Engineering, The

Robo sapiens The Forefront of AI? CPSC 433 Christian Jacob Dept. of Computer Science Dept. of

Printout Tuesday, October 29, 2019 7:38 PM Quick Notes Page 1 Quick Notes Page 2 Quick Notes

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Slides from lecture Friday, April 26, 2019 12:02 PM Unfiled Notes Page 1 Unfiled Notes Page 2

C R RAO AIMSCS Lecture Notes Series Author (s): B.L.S. PRAKASA RAO Title of the Notes : Brief

Alexander Volya 2016, Feb. GGI Lecture notes www.volya.net Alexander Volya 2016, Feb. GGI

IBM Model 701 (Early 1950's) CS 140 Lecture Notes: Introduction Slide 1 IBM 7094 (Early 1960's)

Problem solved: IBM Notes Replacement 2 IBM Notes Replacement Migrating from IBM Notes to

Briefing Notes The Briefing Notes Page The Briefing Notes include: An introduction to the

Lecture outline 433-324 Graphics and Interaction Illumination Models Adrian Pearce Introduction

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 Inheritance Concepts

Concept of RAMS Information System CERN/18 th Sept 2017/Workshop AIT Austrian Institute of

M/441 Current status 16 December 2010 Ofgem David Johnson Co-chair SMCG Report Group

Introduction to Seismic Essentials in Groningen 7.2 Steel Structures By Prof Milan Veljkovic

ENGINEERING (MENGE, GIT, & OPENGL) MENGE German for multitude, many, or crowd Modular

Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios Papagiannis 1,2 , Giorgos

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

ESJ Public Meeting Technology August 29, 2018 Model Background Water Resources Model Over

Lecture notes for CS 433 - Chapter 4 11/7/2019 Chapter 5: - PDF document

Lecture notes for CS 433 - Chapter 4 11/7/2019 Chapter 5: Thread-Level Parallelism Part 1 What is a parallel or multiprocessor system? Introduction Multiple processor units working together to solve the same problem What is a parallel or

Legionella Detection Test Kits sales@novatech-usa.com www.novatech-usa.com Tel: (866) 433-6682

Lecture notes for CS 433 - Chapter 2, part 2 9/26/18 Branch Prediction Buffer Strategies:

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

433-380 Graphics and Computation Department of Computer Science and Software Engineering, The

Robo sapiens The Forefront of AI? CPSC 433 Christian Jacob Dept. of Computer Science Dept. of

Printout Tuesday, October 29, 2019 7:38 PM Quick Notes Page 1 Quick Notes Page 2 Quick Notes

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Slides from lecture Friday, April 26, 2019 12:02 PM Unfiled Notes Page 1 Unfiled Notes Page 2

C R RAO AIMSCS Lecture Notes Series Author (s): B.L.S. PRAKASA RAO Title of the Notes : Brief

Alexander Volya 2016, Feb. GGI Lecture notes www.volya.net Alexander Volya 2016, Feb. GGI

IBM Model 701 (Early 1950's) CS 140 Lecture Notes: Introduction Slide 1 IBM 7094 (Early 1960's)

Problem solved: IBM Notes Replacement 2 IBM Notes Replacement Migrating from IBM Notes to

Briefing Notes The Briefing Notes Page The Briefing Notes include: An introduction to the

Lecture outline 433-324 Graphics and Interaction Illumination Models Adrian Pearce Introduction

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 Inheritance Concepts

Concept of RAMS Information System CERN/18 th Sept 2017/Workshop AIT Austrian Institute of

M/441 Current status 16 December 2010 Ofgem David Johnson Co-chair SMCG Report Group

Introduction to Seismic Essentials in Groningen 7.2 Steel Structures By Prof Milan Veljkovic

ENGINEERING (MENGE, GIT, &amp; OPENGL) MENGE German for multitude, many, or crowd Modular

Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios Papagiannis 1,2 , Giorgos

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

ESJ Public Meeting Technology August 29, 2018 Model Background Water Resources Model Over

ENGINEERING (MENGE, GIT, & OPENGL) MENGE German for multitude, many, or crowd Modular