Multiprocessors and Multithreading Jason Mars Sunday, March 3, 13

Parallel Architectures for Executing Multiple Threads Sunday, March 3, 13

Parallel Architectures for Executing Multiple Threads • Multiprocessor – multiple CPUs tightly coupled enough to cooperate on a single problem. Sunday, March 3, 13

Parallel Architectures for Executing Multiple Threads • Multiprocessor – multiple CPUs tightly coupled enough to cooperate on a single problem. • Multithreaded processors (e.g., simultaneous multithreading) – single CPU core that can execute multiple threads simultaneously. Sunday, March 3, 13

Parallel Architectures for Executing Multiple Threads • Multiprocessor – multiple CPUs tightly coupled enough to cooperate on a single problem. • Multithreaded processors (e.g., simultaneous multithreading) – single CPU core that can execute multiple threads simultaneously. • Multicore processors – multiprocessor where the CPU cores coexist on a single processor chip. Sunday, March 3, 13

Multiprocessors • Not that long ago, multiprocessors were expensive, exotic machines – special-purpose engines to solve hard problems. • Now they are pervasive. Processor Processor Processor Cache Cache Cache Single bus Memory I/O Sunday, March 3, 13

Classifying Multiprocessors • Flynn Taxonomy • Interconnection Network • Memory Topology • Programming Model Sunday, March 3, 13

Flynn Taxonomy • SISD (Single Instruction Single Data) • Uniprocessors • SIMD (Single Instruction Multiple Data) • Examples: Illiac-IV, CM-2, Nvidia GPUs, etc. • Simple programming model • Low overhead • MIMD (Multiple Instruction Multiple Data) • Examples: many, nearly all modern multiprocessors or multicores • Flexible • Use o ff -the-shelf microprocessors or microprocessor cores • MISD (Multiple Instruction Single Data) • ??? Sunday, March 3, 13

Interconnection Networks • Bus • Network • pros/cons? Processor Processor Processor Cache Cache Cache Single bus Memory I/O Sunday, March 3, 13

Memory Topology • UMA (Uniform Memory Access) • NUMA (Non-uniform Memory Access) • pros/cons? Processor Processor Processor Cache Cache Cache Processor Processor Processor Single bus Cache Cache Cache Memory I/O cpu M Memory Memory Memory cpu M Network Network cpu M . . . . . . cpu M Sunday, March 3, 13

Programming Model • Shared Memory -- every processor can name every address location • Message Passing -- each processor can name only it’s local memory. Communication is through explicit messages. • pros/cons? Processor Processor Processor Cache Cache Cache Memory Memory Memory Network Sunday, March 3, 13

Programming Model • Shared Memory -- every processor can name every address location • Message Passing -- each processor can name only it’s local memory. Communication is through explicit messages. • pros/cons? find the max of 100,000 integers on 10 processors. Processor Processor Processor Cache Cache Cache Memory Memory Memory Network Sunday, March 3, 13

Parallel Programming i = 47 Processor A Processor B index = i++; index = i++; • Shared-memory programming requires synchronization to provide mutual exclusion and prevent race conditions • locks (semaphores) • barriers Sunday, March 3, 13

Parallel Programming i = 47 Processor A Processor B index = i++; index = i++; load i; load i; inc i; inc i; store i; store i; • Shared-memory programming requires synchronization to provide mutual exclusion and prevent race conditions • locks (semaphores) • barriers Sunday, March 3, 13

Parallel Programming i = 47 Processor A Processor B index = i++; index = i++; load i; inc i; store i; load i; inc i; store i; • Shared-memory programming requires synchronization to provide mutual exclusion and prevent race conditions • locks (semaphores) • barriers Sunday, March 3, 13

Parallel Programming i = 47 Processor A Processor B index = i++; index = i++; • Shared-memory programming requires synchronization to provide mutual exclusion and prevent race conditions • locks (semaphores) • barriers Sunday, March 3, 13

Parallel Programming i = 47 Processor A Processor B index = i++; index = i++; load i; load i; inc i; inc i; store i; store i; • Shared-memory programming requires synchronization to provide mutual exclusion and prevent race conditions • locks (semaphores) • barriers Sunday, March 3, 13

But... • That ignores the existence of caches • How do caches complicate the problem of keeping data consistent between processors? Sunday, March 3, 13

Multiprocessor Caches (Shared Memory) • the problem -- cache coherency • the solution? Processor Processor Processor i i Cache Cache Cache Single bus Memory I/O Sunday, March 3, 13

Multiprocessor Caches (Shared Memory) • the problem -- cache coherency • the solution? inc i; Processor Processor Processor i i Cache Cache Cache Single bus Memory I/O Sunday, March 3, 13

Multiprocessor Caches (Shared Memory) • the problem -- cache coherency • the solution? inc i; load i; Processor Processor Processor i i Cache Cache Cache Single bus Memory I/O Sunday, March 3, 13

What Does Coherence Mean? • Informally: • Any read must return the most recent write • Too strict and very di ffi cult to implement • Better: • A processor sees its own writes to a location in the correct order. • Any write must eventually be seen by a read • All writes are seen in order (“serialization”). Writes to the same location are seen in the same order by all processors. • Without these guarantees, synchronization doesn’t work Sunday, March 3, 13

Solutions Sunday, March 3, 13

Solutions • Snooping Solution (Snoopy Bus): • Send all requests for unknown data to all processors • Processors snoop to see if they have a copy and respond accordingly • Requires “broadcast”, since caching information is at processors • Works well with bus (natural broadcast medium) • Dominates for small scale machines (most of the market) Sunday, March 3, 13

Solutions • Snooping Solution (Snoopy Bus): • Send all requests for unknown data to all processors • Processors snoop to see if they have a copy and respond accordingly • Requires “broadcast”, since caching information is at processors • Works well with bus (natural broadcast medium) • Dominates for small scale machines (most of the market) • Directory-Based Schemes • Keep track of what is being shared in one centralized place (for each address) => the directory • Distributed memory => distributed directory (avoids bottlenecks) • Send point-to-point requests to processors (to invalidate, etc.) • Scales better than Snooping for large multiprocessors Sunday, March 3, 13

Implementing Coherence Protocols • How do you find the most up-to-date copy of the desired data? • Snooping protocols • Directory protocols Processor Processor Processor Snoop Cache tag Snoop Cache tag Snoop Cache tag tag and data tag and data tag and data Single bus Memory I/O Sunday, March 3, 13

Implementing Coherence Protocols • How do you find the most up-to-date copy of the desired data? • Snooping protocols • Directory protocols Processor Processor Processor Snoop Cache tag Snoop Cache tag Snoop Cache tag tag and data tag and data tag and data Single bus Memory I/O Write-Update vs Write-Invalidate Sunday, March 3, 13

Parallel Architectures for Executing Multiple Threads • Multiprocessor – multiple CPUs tightly coupled enough to cooperate on a single problem. • Multithreaded processors (e.g., simultaneous multithreading) – single CPU core that can execute multiple threads simultaneously. • Multicore processors – multiprocessor where the CPU cores coexist on a single processor chip. Sunday, March 3, 13

Simultaneous Multithreading (A Few of Dean Tullsen’s 1996 Thesis Slides) Dean Tullsen Sunday, March 3, 13

Hardware Multithreading instruction stream Conventional Processor PC regs CPU Dean Tullsen Sunday, March 3, 13

Hardware Multithreading Multithreaded instruction stream Conventional Processor PC regs CPU Dean Tullsen Sunday, March 3, 13

Hardware Multithreading Multithreaded instruction stream Conventional Processor PC PC regs regs CPU Dean Tullsen Sunday, March 3, 13

Hardware Multithreading Multithreaded instruction stream Conventional Processor PC PC PC regs regs regs CPU Dean Tullsen Sunday, March 3, 13

Hardware Multithreading Multithreaded instruction stream Conventional Processor PC PC PC regs regs regs PC regs CPU Dean Tullsen Sunday, March 3, 13

Superscalar (vs Superpipelined) (multiple instructions in the same stage, same CR as scalar) (more total stages, faster clock rate) Sunday, March 3, 13

Superscalar Execution Issue Slots Time (proc cycles) Dean Tullsen Sunday, March 3, 13

Superscalar Execution Issue Slots Vertical waste Time (proc cycles) Dean Tullsen Sunday, March 3, 13

Multiprocessors and Multithreading Jason Mars Sunday, March 3, 13 - PowerPoint PPT Presentation

Multiprocessors and Multithreading Jason Mars Sunday, March 3, 13 Parallel Architectures for Executing Multiple Threads Sunday, March 3, 13 Parallel Architectures for Executing Multiple Threads Multiprocessor multiple CPUs tightly

MULTITHREADING ON IOS AGENDA Multithreading Basics Interlude: Closures Multithreading on iOS

Multithreading Recursion Checkout Multithreading and Recursion project from SVN Joe Armstrong,

Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance

Multithreading Checkout Multithreading project from SVN Joe Armstrong, Programming in

Multithreading Basics thread state: runnable, blocked Multithreading start, sleep,

Multithreading Horstmann ch.9 Multithreading Threads Thread states Thread

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Why Multiprocessors? Limits on the performance of a single processor: what are they? Spring 2009

Lecture 10: Multithreading and Condition Variables The Dining Philosophers Problem This is a

Multithreading programming Jan Faigl Department of Computer Science Faculty of Electrical

Register Relocation Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl

Multiprocessors/Multicores Presented by Yue Gao September 26, 2013 Presented by Yue Gao

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

CSE 451 Section Assignment 2 Overview File management System calls: open, close, read,

Process Management Outline Main concepts Basic services for process management (Linux

Linux perf_events updates Stephane Eranian Scalable Tools Workshop 2018 Solitude, UT

Dense matrix algorithms We are going to study algorithms involving dense matrices (as opposed

Energy Efficiency Metrics and Cray XE6 Application Performance Wilfried Oed Principal Engineer

Processes This lecture starts a class segment that covers processes, threads, and

Operating Systems ECE344 Lecture 3: Processes Ding Yuan Processes This lecture starts a

Assessing the Performance of MPI Applications Through Time-Independent Trace Replay . Desprez 1