COMP 590-154: Computer Architecture Shared-Memory Multi-Processors

Shared-Memory Multiprocessors • Multiple threads use shared memory (address space) – “SysV Shared Memory” or “Threads” in software • Communication implicit via loads and stores – Opposite of explicit message-passing multiprocessors • Theoretical foundation: PRAM model P 1 P 2 P 3 P 4 Memory System

Why Shared Memory? • Pluses – App sees multitasking uniprocessor – OS needs only evolutionary extensions – Communication happens without OS • Minuses – Synchronization is complex – Communication is implicit (hard to optimize) – Hard to implement (in hardware) • Result – SMPs and CMPs are most successful machines to date – First with multi-billion-dollar markets

Paired vs. Separate Processor/Memory? • Separate CPU/memory • Paired CPU/memory – Uniform memory access – Non-uniform memory access ( UMA ) ( NUMA ) • Equal latency to memory • Faster local memory • Data placement matters – Low peak performance – High peak performance CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem Mem Mem Mem

Shared vs. Point-to-Point Networks • Shared network • Point-to-point network: – Example: bus – Example: mesh, ring – Low latency – High latency (many “ hops ”) – Low bandwidth – Higher bandwidth • Doesn’t scale >~16 cores • Scales to 1000s of cores – Simple cache coherence – Complex cache coherence CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem R R Mem Mem R R Mem CPU($) CPU($)

Organizing Point-To-Point Networks • Network topology : organization of network – Trade off perf. (connectivity, latency, bandwidth) « cost • Router chips – Networks w/separate router chips are indirect – Networks w/ processor/memory/router in chip are direct • Fewer components, “ Glueless MP ” R CPU($) CPU($) Mem R R Mem R R Mem R Mem R Mem R Mem R Mem R R Mem CPU($) CPU($) CPU($) CPU($) CPU($) CPU($)

Issues for Shared Memory Systems • Two big ones – Cache coherence – Memory consistency model • Closely related • Often confused

Cache Coherence: The Problem (1/2) • Variable A initially has value 0 • P1 stores value 1 into A • P2 loads A from memory and sees old value 0 P1 P2 t1: Store A=1 t2: Load A? A: 0 1 A: 0 A: 0 L1 L1 Bus A: 0 Main Memory Need to do something to keep P2’s cache coherent

Cache Coherence: The Problem (2/2) • P1 and P2 have variable A (value 0) in their caches • P1 stores value 1 into A • P2 loads A from its cache and sees old value 0 P1 P2 t1: Store A=1 t2: Load A? A: 0 1 A: 0 A: 0 L1 L1 Bus A: 0 Main Memory Need to do something to keep P2’s cache coherent

Approaches to Cache Coherence • Software-based solutions – Mechanisms: • Mark cache blocks/memory pages as cacheable/non-cacheable • Add “Flush” and “Invalidate” instructions – Could be done by compiler or run-time system – Difficult to get perfect (e.g., what about memory aliasing?) • Hardware solutions are far more common – System ensures everyone always sees the latest value

Coherence with Write-through Caches • Allows multiple readers, but writes through to bus – Requires Write-through, no-write-allocate cache • All caches must monitor (aka “ snoop ”) all bus traffic – Simple state machine for each cache frame P1 P2 t1: Store A=1 A [V]: 0 1 A [V I]: 0 A [V]: 0 A [V]: 0 Write-through t3: Invalidate A No-write-allocate Bus t2: BusWr A=1 A: 0 1 A: 0 Main Memory

Valid-Invalid Snooping Protocol • Processor Actions – Ld, St, BusRd, BusWr Load / -- • Bus Messages Store / BusWr – BusRd, BusWr • Track 1 bit per cache frame Valid – Valid/Invalid BusWr / -- Load / BusRd Invalid Store / BusWr

Supporting Write-Back Caches • Write-back caches are good – Drastically reduce bus write bandwidth • Add notion of “ ownership ” to Valid-Invalid – When “ owner ” has only replica of a cache block • Update it freely – Multiple readers are ok • Not allowed to write without gaining ownership – On a read, system must check if there is an owner • If yes, take away ownership

Modified-Shared-Invalid (MSI) States • Processor Actions – Load, Store, Evict • Bus Messages – BusRd, BusRdX, BusInv, BusWB, BusReply (Here for simplicity, some messages can be combined) • Track 3 states per cache frame – Invalid : cache does not have a copy – Shared : cache has a read-only copy; clean • Clean: memory (or later caches) is up to date – Modified : cache has the only valid copy; writable; dirty • Dirty: memory (or later caches) is out of date

Simple MSI Protocol (1/9) Load / BusRd Invalid Shared 1: Load A P1 P2 A [I S]: 0 A [I] A [I] 2: BusRd A Bus A: 0 3: BusReply A

Simple MSI Protocol (2/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared 1: Load A 1: Load A P1 P2 A [S]: 0 A [I S]: 0 A [I] 2: BusRd A 3: BusReply A Bus A: 0

Simple MSI Protocol (3/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared Evict / -- Evict A P1 P2 A [S I] A [S]: 0 A [S]: 0 A [I] Bus A: 0

Simple MSI Protocol (4/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX / [BusReply] Evict / -- Store / BusRdX 1: Store A P1 P2 A [S I]: 0 A [S]: 0 A [I M]: 0 1 A [I] 2: BusRdX A 3: BusReply A Modified Bus A: 0 Load, Store / --

Simple MSI Protocol (5/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX / [BusReply] Evict / -- Store / BusRdX BusRd / BusReply 1: Load A P1 P2 A [I S]: 1 A [I] A [M S]: 1 A [M]: 1 3: BusReply A 2: BusRd A Modified Bus A: 0 1 A: 0 Load, Store / -- 4: Snarf A

Simple MSI Protocol (6/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX, BusInv / [BusReply] BusRdX / [BusReply] Evict / -- Store / BusRdX BusRd / BusReply 1: Store A aka “ Upgrade ” Store / BusInv P1 P2 A [S M]: 2 A [S]: 1 A [S]: 1 A [S I] 2: BusInv A Modified Bus A: 1 Load, Store / --

Simple MSI Protocol (7/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX, BusInv / [BusReply] Evict / -- BusRdX / BusReply Store / BusRdX BusRd / BusReply 1: Store A Store / BusInv P1 P2 A [M I]: 2 A [M]: 2 A [I M]: 3 A [I] 2: BusRdX A 3: BusReply A Modified Bus A: 1 Load, Store / --

Simple MSI Protocol (8/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX, BusInv / [BusReply] Evict / -- BusRdX / BusReply Store / BusRdX BusRd / BusReply Evict / BusWB 1: Evict A Store / BusInv P1 P2 A [I] A [M I]: 3 A [M]: 3 2: BusWB A Modified Bus A: 1 3 A: 1 Load, Store / --

Simple MSI Protocol (9/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX, BusInv / [BusReply] Evict / -- BusRdX / BusReply Store / BusRdX BusRd / BusReply Evict / BusWB Cache Actions: Store / BusInv • Load, Store, Evict Bus Actions: • BusRd, BusRdX BusInv, BusWB, Modified BusReply Load, Store / -- Usable coherence protocol

Scalable Cache Coherence • Part I: bus bandwidth – Replace non-scalable bandwidth substrate (bus) …with scalable-bandwidth one (e.g., mesh) • Part II: processor snooping bandwidth – Most snoops result in no action – Replace non-scalable broadcast protocol (spam everyone) …with scalable directory protocol (spam cores that care) Requires a “ directory ” to keep track of “ sharers ”

Directory Coherence Protocols • Extend memory to track caching information • For each physical cache line, a home directory tracks: – Owner: core that has a dirty copy (i.e., M state) – Sharers: cores that have clean copies (i.e., S state) • Cores send coherence events to home directory – Home directory only sends events to cores that care

Read Transaction • L has a cache miss on a load instruction 1: Read Req L H 2: Read Reply

4-hop Read Transaction • L has a cache miss on a load instruction – Block was previously in modified state at R State: M Owner: R 1: Read Req 2: Recall Req L H R 4: Read Reply 3: Recall Reply

3-hop Read Transaction • L has a cache miss on a load instruction – Block was previously in modified state at R State: M Owner: R 1: Read Req 2: Fwd’d Read Req L H R 3: Fwd’d Read Ack 3: Read Reply

An Example Race: Writeback & Read • L has dirty copy, wants to write back to H • R concurrently sends a read to H Race! Race ! State: M Final State: S WB & Fwd Rd Owner: L No need to Ack No need to ack 1: WB Req 2: Read Req 6: L H R 4: 3: Fwd’d Read Req 5: Read Reply Races require complex intermediate states

Basic Operation: Read L Directory R Read A (miss) Read A A: Shared, #1 A i l l F Typical way to reason about directories

Basic Operation: Write L Directory R Read A (miss) Read A A: Shared, #1 A i l l F A e s i v u x c l d E a R e A e a t i d v a l n I A: Mod., #2 I n v A c k A Fill A

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors - PowerPoint PPT Presentation

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit via loads

COMP 590-154: Computer Architecture Branch Prediction Fragmentation due to Branches Fetch

COMP 590-154: Computer Architecture Core Pipelining Generic Instruction Cycle Steps in

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

COMP 590-154: Computer Architecture Prefetching Prefetching (1/3) Fetch block ahead of demand

Electric Potential and Capacitors www.njctl.org Slide 3 / 154 Slide 4 / 154 How to Use this

Electric Potential and Capacitors www.njctl.org Slide 3 / 154 Slide 4 / 154 How to Use this

154 GRAND ST PAINTED SIGN MASTER PLAN APPLICATION Lot Diagram Zoning Map 2 154 GRAND ST -

Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02

De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02

Post-processing outputs for better utility CompSci 590.03 Instructor: Ashwin Machanavajjhala

Wavelet and Matrix Mechanism CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 11 :

Transformations Composition of Transformations Congruence Transformations Dilations

Chapter 2. Walks (Chapters 1.7, 2.12.6) Prof. Tesler Math 154 Winter 2020 Prof. Tesler Ch.

Geometry Transformations 2014-09-08 www.njctl.org Slide 3 / 154 Table of Contents click on

Transformations Composition of Transformations Congruence Transformations Dilations

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Concurrency and Memory Models Filip Sieczkowski Why concurrency? Moores law Every two

A classic locked-room mystery. Eve was in the false branch of a conditional the whole time, how

Parametric Verification of Concurrent Programs under the TSO Weak Memory Model Ahmed Bouajjani

Programming a multicore architecture without coherency and atomic operations Jochem Rutgers ,

Nonlocal Cahn-Hilliard-Navier-Stokes systems with nonconstant mobility Maurizio Grasselli

A Denotational Study of Mobility Jo el-Alexis Bialkiewicz and Fr ed eric Peschanski

Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer