parallel memory architecture
play

PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 6 is due tonight n The last one! This lecture


  1. PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture

  2. Overview ¨ Announcement ¤ Homework 6 is due tonight n The last one! ¨ This lecture ¤ Communication in multiprocessors ¤ Parallel memory architecture ¤ Cache coherence protocol

  3. Example Code I ¨ A sequential application runs as a single thread Kernel Function: Memory A void kern (int start, int end) { int i; 1 n … for(i=start; i<=end; ++i) { A[i] = A[i] * A[i] + 5; } } Processor Single Thread main() { … kern ( 1 , n ); … }

  4. Example Code I ¨ Two threads operating on separate partitions Kernel Function: Memory A void kern (int start, int end) { int i; 1 n for(i=start; i<=end; ++i) { A[i] = A[i] * A[i] + 5; } } Processor Processor Thread 0 main() { … Thread 1 kern ( 1 , n/2 ); … kern ( n/2+1 , n ); }

  5. Performance of Parallel Processing ¨ Recall: Amdahl’s law for theoretical speedup ¤ Overall speedup is limited to the fraction of the program that can be executed in parallel ! speedup = f : sequential fraction "# $%& ' Speedup vs. Sequential Fraction 10 10 x 8 Speedup 6 5 x 4 ~ 2 x 2 ~ 1 x 0 0 50 100 150 Number of Processors 10% 20% 40% 60% 90%

  6. Example Code II ¨ A single location is updated every time Kernel Function: Memory A void kern (int start, int end) { int i; 1 n for(i=start; i<=end; ++i) { sum sum = sum * A[i]; } } Processor Thread 0 main() { … kern ( 1 , n ); … }

  7. Example Code II ¨ Two threads operating on separate partitions Kernel Function: Memory A void kern (int start, int end) { int i; 1 n for(i=start; i<=end; ++i) { sum sum = sum * A[i]; } } Processor Processor Thread 0 main() { … Thread 1 kern ( 1 , n/2 ); … kern ( n/2+1 , n ); }

  8. Communication in Multiprocessors ¨ How multiple processor cores communicate? Shared Memory Message Passing § Multiple threads employ § Explicit communication shared memory through interconnection § Easy for programmers network § Simple hardware (loads and stores) Core Core Core Core … … 1 N 1 N Mem Mem Shared Memory Interconnection Network

  9. Shared Memory Architectures Uniform Memory Access Non-Uniform Memory Access ¨ Equal latency for all ¨ Access latency is processors proportional to proximity ¨ Simple software ¤ Fast local accesses control Example UMA Example NUMA Core Core Core Core … … Mem Mem 1 4 1 4 Router Router Memory

  10. Network Topologies Shared Network Point to Point Network ¨ Low latency ¨ High latency ¨ Low bandwidth ¨ High bandwidth ¨ Simple control ¨ Complex control ¤ e.g., bus ¤ e.g., mesh, ring Core Core Mem Mem 1 2 Core Core … Mem Mem Router Router 1 4 Router Router Router Router 4 3 Mem Mem Core Core

  11. Challenges in Shared Memories ¨ Correctness of an application is influenced by ¤ Memory consistency n All memory instructions appear to execute in the program order n Known to the programmer ¤ Cache coherence n All the processors see the same data for a particular memory address as they should have if there were no caches in the system n Invisible to the programmer

  12. Cache Coherence Problem ¨ Multiple copies of each cache block ¤ In main memory and caches ¨ Multiple copies can get inconsistent when writes happen ¤ Solution: propagate writes from one core to others core Core … 1 N Cache Cache 1 N Main Memory

  13. Scenario 1: Loading From Memory ¨ Variable A initially has value 0 ¨ P1 stores value 1 into A ¨ P2 loads A from memory and sees old value 0 P1 P2 Cache Cache Bus A:0 Memory

  14. Scenario 2: Loading From Cache ¨ P1 and P2 both have variable A (value 0) in their caches ¨ P1 stores value 1 into A ¨ P2 loads A from its cache and sees old value P1 P2 Cache Cache Bus A:0 Memory

  15. Cache Coherence ¨ The key operation is update/invalidate sent to all or a subset of the cores ¤ Software based management n Flush: write all of the dirty blocks to memory n Invalidate: make all of the cache blocks invalid ¤ Hardware based management n Update or invalidate other copies on every write n Send data to everyone, or only the ones who have a copy ¨ Invalidation based protocol is better. Why?

  16. Snoopy Protocol ¨ Relying on a broadcast infrastructure among caches ¤ For example shared bus ¨ Every cache monitors (snoop) the traffic on the shared media to keep the states of the cache block up to date … Core Core … Core Core L1 L1 L1 L1 LLC LLC Memory Memory

  17. Simple Snooping Protocol ¨ Relies on write-through, write no-allocate cache ¨ Multiple readers are allowed ¤ Writes invalidate replicas ¨ Employs a simple state machine for each cache unit P1 P2 Cache Cache Bus A:0 Memory

  18. Simple Snooping State Machine ¨ Every node updates its one-bit valid flag using a simple finite Load/-- Store/BusWr state machine (FSM) Valid ¨ Processor actions Evict/-- BusWr/-- Load/BusRd ¤ Load, Store, Evict Invalid ¨ Bus traffic Store/BusWr ¤ BusRd, BusWr Transaction by local actions Transaction by bus traffic

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend