SHARED MEMORY SYSTEMS Mahdi Nazm Bojnordi Assistant Professor - PowerPoint PPT Presentation

SHARED MEMORY SYSTEMS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture

Overview ¨ Announcement ¤ Final exam: in-class, 10:30AM-12:30PM, Dec. 13 th ¨ This lecture ¤ Shared memory systems ¤ Cache coherence with write back policy ¤ Memory consistency

Recall: Cache Coherence Problem ¨ Multiple copies of each cache block ¤ In main memory and caches ¨ Multiple copies can get inconsistent when writes happen ¤ Solution: propagate writes from one core to others core Core … 1 N Cache Cache 1 N Main Memory

Cache Coherence ¨ The key operation is update/invalidate sent to all or a subset of the cores ¤ Software based management n Flush: write all of the dirty blocks to memory n Invalidate: make all of the cache blocks invalid ¤ Hardware based management n Update or invalidate other copies on every write n Send data to everyone, or only the ones who have a copy ¨ Invalidation based protocol is better. Why?

Snoopy Protocol ¨ Relying on a broadcast infrastructure among caches ¤ For example shared bus ¨ Every cache monitors (snoop) the traffic on the shared media to keep the states of the cache block up to date … Core Core … Core Core L1 L1 L1 L1 LLC LLC Memory Memory

Simple Snooping Protocol ¨ Relies on write-through, write no-allocate cache ¨ Multiple readers are allowed ¤ Writes invalidate replicas ¨ Employs a simple state machine for each cache unit P1 P2 Cache Cache Bus A:0 Memory

Simple Snooping State Machine ¨ Every node updates its one-bit valid flag using a simple finite Load/-- Store/BusWr state machine (FSM) Valid ¨ Processor actions Evict/-- BusWr/-- Load/BusRd ¤ Load, Store, Evict Invalid ¨ Bus traffic Store/BusWr ¤ BusRd, BusWr Transaction by local actions Transaction by bus traffic

Shared Memory Systems ¨ Multiple threads employ a shared memory system ¤ Easy for programmers ¨ Complex synchronization mechanisms are required ¤ Cache coherence n All the processors see the same data for a particular memory address as they should have if there were no caches in the system n e.g., snoopy protocol with write-through, write no-allocate n Inefficient ¤ Memory consistency n All memory instructions appear to execute in the program order n e.g., sequential consistency

Snooping with Writeback Policy ¨ Problem: writes are not propagated to memory until eviction ¤ Cache data maybe different from main memory ¨ Solution: identify the owner of the most recently updated replica ¤ Every data may have only one owner at any time ¤ Only the owner can update the replica ¤ Multiple readers can share the data n No one can write without gaining ownership first

Modified-Shared-Invalid Protocol ¨ Every cache block transitions among three states ¤ Invalid: no replica in the cache ¤ Shared: a read-only copy in the cache n Multiple units may have the same copy ¤ Modified: a writable copy of the data in the cache n The replica has been updated n The cache has the only valid copy of the data block ¨ Processor actions ¤ Load, store, evict ¨ Bus messages ¤ BusRd, BusRdX, BusInv, BusWB, BusReply

MSI Example Load/BusRd invalid shared P1 P2 Load I I BusRd BUS BusReply

MSI Example BusRd/[BusReply] Load/BusRd invalid shared Load/-- P1 P2 Load S I BusRd BUS

MSI Example BusRd/[BusReply] Load/BusRd invalid shared Evict/-- Load/-- P1 P2 Evict S S BUS

MSI Example BusRd/[BusReply] Load/BusRd BusRdX/[BusReply] invalid shared Evict/-- Load/-- Store/BusRdX P1 P2 Store S I modified BUS Load, Store/--

MSI Example BusRd/[BusReply] Load/BusRd BusRdX/[BusReply] invalid shared Evict/-- Load/-- Store/BusRdX BusRd/BusReply P1 P2 Load I M modified BUS Load, Store/--

MSI Example BusRd/[BusReply] Load/BusRd BusInv,BusRdX/[BusReply] invalid shared Evict/-- Load/-- Store/BusRdX BusRd/BusReply P1 P2 Store S S Store/BusInv modified BUS Load, Store/--

MSI Example BusRd/[BusReply] Load/BusRd BusInv,BusRdX/[BusReply] invalid shared Evict/-- Load/-- BusRdX/BusReply Store/BusRdX BusRd/BusReply P1 P2 Store M I Store/BusInv modified BUS Load, Store/--

MSI Example BusRd/[BusReply] Load/BusRd BusInv,BusRdX/[BusReply] invalid shared Evict/-- Load/-- BusRdX/BusReply Store/BusRdX BusRd/BusReply P1 P2 Evict I M Store/BusInv BusWB modified BUS Load, Store/--

Modified, Exclusive, Shared, Invalid ¨ Also known as Illinois protocol ¤ Employed by real processors ¤ A cache may have an exclusive copy of the data ¤ The exclusive copy may be copied between caches ¨ Pros ¤ No invalidation traffic on write-hits in the E state ¤ Lower overheads in sequential applications ¨ Cons ¤ More complex protocol ¤ Longer memory latency due to the protocol

Alternatives to Snoopy Protocols ¨ Problem: snooping based protocols are not scalable ¤ Shared bus bandwidth is limited ¤ Every node broadcasts messages and monitors the bus ¨ Solution: limit the traffic using directory structures ¤ Home directory keeps track of sharers of each block Core Core Core Core Cache Cache Cache Cache Directory Directory Directory Directory Interconnection Network

Memory Consistency Model ¨ Memory operations are reordered to improve performance ¨ A memory consistency model for a shared address space specifies constraints on the order in which memory operations must appear to be performed with respect to one another. Initially A = flag = 0 P2 P1 What is the expected output of A=1; while (flag==0); flag = 1; printf (“%d”, A); this application?

Memory Consistency ¨ Recall: load-store queue architecture ¤ Check availability of operands ¤ Compute the effective address ¤ Send the request to memory if no memory hazards Initially A = flag = 0 P2 P1 (2) 0 A=1; while (flag==0); 1 (1) flag = 1; printf (“%d”, A);

Dekker’s Algorithm Example ¨ Critical region with mutually exclusive access ¤ Any time, one process is allowed to be in the region ¨ Reordering in load-store queue may result in failure Initially A = B = 0 P2 P1 (2) (2) LOCK_A: A = 1; LOCK_B: B = 1; (1) (1) if (B != 0) { if (A != 0) { A = 0; B = 0; goto LOCK_A; goto LOCK_B; } } // … // … A = 0; B = 0;

Sequential Consistency ¨ 1. within a program, program order is preserved ¨ 2. each instruction executes atomically ¨ 3. instructions from different threads can be interleaved arbitrarily P2 P1 … P1 P2 Pn a A 1. abAcBCDdeE b B 2. aAbBcCdDeE c C 3. ABCDEabcde d D Memory Bad Performance!

Relaxed Consistency Model ¨ Real processors do not implement sequential consistency ¤ Not all instructions need to be executed in program order ¤ e.g., a read can bypass earlier writes ¨ A fence instruction can be used to enforce ordering among memory instructions ¤ e.g., Dekker’s algorithm with fence P2 P1 LOCK_A: A = 1; LOCK_B: B = 1; fence; fence; if (B != 0) { if (A != 0) { A = 0; B = 0; goto LOCK_A; goto LOCK_B; } }

Fence Example P1 P2 { { Region of code Region of code with no races with no races } } Fence Fence Acquire_lock Acquire_lock Fence Fence { { Racy code Racy code } } Fence Fence Release_lock Release_lock Fence Fence

SHARED MEMORY SYSTEMS Mahdi Nazm Bojnordi Assistant Professor - PowerPoint PPT Presentation

SHARED MEMORY SYSTEMS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Final exam: in-class, 10:30AM-12:30PM, Dec. 13 th This lecture Shared

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Shared Memory Bus for Multiprocessor Systems Mat Laibowitz and Albert Chiou Group 6 Shared

Distributed Shared Memory Distributed Shared Memory Systems Page based

Distributed Shared Memory and Machine Learning CSci 8211 Chai-Wen Hsieh 11/5/2018 Agenda

Operating Systems WT 2019/20 Memory Management Shared Memory Process 1 virtual memory most

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Message Passing DM519 Concurrent Programming 1 1 Absence Of Shared Memory In previous lectures

Review of Memory Models: A Case for Rethinking Parallel Languages and Hardware by Sarita V. Adve

Software Specification and Verification in Rewriting Logic: Lecture 1 Jos e Meseguer Computer

Lecture 2: Intro to Concurrent Processing The SR Language. Correctness and Concurrency.

Roadmap for Section 3.1. The Critical-Section Problem Software Solutions Synchronization

Towards a separation logic for Multicore OCaml Glen Mvel , Jacques-Henri Jourdan, Franois

On various ways to split a floating-point number Claude-Pierre Jeannerod Jean-Michel Muller Paul

AM P A R CudA Multiple Precision ARithmetic librarY Floating point arithmetics A real

Accelerate Iterative Methods Good Algorithms Mixed Precision Iterative Methods Good