PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture

Overview ¨ Announcement ¤ Homework 6 is due tonight n The last one! ¨ This lecture ¤ Communication in multiprocessors ¤ Parallel memory architecture ¤ Cache coherence protocol

Example Code I ¨ A sequential application runs as a single thread Kernel Function: Memory A void kern (int start, int end) { int i; 1 n … for(i=start; i<=end; ++i) { A[i] = A[i] * A[i] + 5; } } Processor Single Thread main() { … kern ( 1 , n ); … }

Example Code I ¨ Two threads operating on separate partitions Kernel Function: Memory A void kern (int start, int end) { int i; 1 n for(i=start; i<=end; ++i) { A[i] = A[i] * A[i] + 5; } } Processor Processor Thread 0 main() { … Thread 1 kern ( 1 , n/2 ); … kern ( n/2+1 , n ); }

Performance of Parallel Processing ¨ Recall: Amdahl’s law for theoretical speedup ¤ Overall speedup is limited to the fraction of the program that can be executed in parallel ! speedup = f : sequential fraction "# $%& ' Speedup vs. Sequential Fraction 10 10 x 8 Speedup 6 5 x 4 ~ 2 x 2 ~ 1 x 0 0 50 100 150 Number of Processors 10% 20% 40% 60% 90%

Example Code II ¨ A single location is updated every time Kernel Function: Memory A void kern (int start, int end) { int i; 1 n for(i=start; i<=end; ++i) { sum sum = sum * A[i]; } } Processor Thread 0 main() { … kern ( 1 , n ); … }

Example Code II ¨ Two threads operating on separate partitions Kernel Function: Memory A void kern (int start, int end) { int i; 1 n for(i=start; i<=end; ++i) { sum sum = sum * A[i]; } } Processor Processor Thread 0 main() { … Thread 1 kern ( 1 , n/2 ); … kern ( n/2+1 , n ); }

Communication in Multiprocessors ¨ How multiple processor cores communicate? Shared Memory Message Passing § Multiple threads employ § Explicit communication shared memory through interconnection § Easy for programmers network § Simple hardware (loads and stores) Core Core Core Core … … 1 N 1 N Mem Mem Shared Memory Interconnection Network

Shared Memory Architectures Uniform Memory Access Non-Uniform Memory Access ¨ Equal latency for all ¨ Access latency is processors proportional to proximity ¨ Simple software ¤ Fast local accesses control Example UMA Example NUMA Core Core Core Core … … Mem Mem 1 4 1 4 Router Router Memory

Network Topologies Shared Network Point to Point Network ¨ Low latency ¨ High latency ¨ Low bandwidth ¨ High bandwidth ¨ Simple control ¨ Complex control ¤ e.g., bus ¤ e.g., mesh, ring Core Core Mem Mem 1 2 Core Core … Mem Mem Router Router 1 4 Router Router Router Router 4 3 Mem Mem Core Core

Challenges in Shared Memories ¨ Correctness of an application is influenced by ¤ Memory consistency n All memory instructions appear to execute in the program order n Known to the programmer ¤ Cache coherence n All the processors see the same data for a particular memory address as they should have if there were no caches in the system n Invisible to the programmer

Cache Coherence Problem ¨ Multiple copies of each cache block ¤ In main memory and caches ¨ Multiple copies can get inconsistent when writes happen ¤ Solution: propagate writes from one core to others core Core … 1 N Cache Cache 1 N Main Memory

Scenario 1: Loading From Memory ¨ Variable A initially has value 0 ¨ P1 stores value 1 into A ¨ P2 loads A from memory and sees old value 0 P1 P2 Cache Cache Bus A:0 Memory

Scenario 2: Loading From Cache ¨ P1 and P2 both have variable A (value 0) in their caches ¨ P1 stores value 1 into A ¨ P2 loads A from its cache and sees old value P1 P2 Cache Cache Bus A:0 Memory

Cache Coherence ¨ The key operation is update/invalidate sent to all or a subset of the cores ¤ Software based management n Flush: write all of the dirty blocks to memory n Invalidate: make all of the cache blocks invalid ¤ Hardware based management n Update or invalidate other copies on every write n Send data to everyone, or only the ones who have a copy ¨ Invalidation based protocol is better. Why?

Snoopy Protocol ¨ Relying on a broadcast infrastructure among caches ¤ For example shared bus ¨ Every cache monitors (snoop) the traffic on the shared media to keep the states of the cache block up to date … Core Core … Core Core L1 L1 L1 L1 LLC LLC Memory Memory

Simple Snooping Protocol ¨ Relies on write-through, write no-allocate cache ¨ Multiple readers are allowed ¤ Writes invalidate replicas ¨ Employs a simple state machine for each cache unit P1 P2 Cache Cache Bus A:0 Memory

Simple Snooping State Machine ¨ Every node updates its one-bit valid flag using a simple finite Load/-- Store/BusWr state machine (FSM) Valid ¨ Processor actions Evict/-- BusWr/-- Load/BusRd ¤ Load, Store, Evict Invalid ¨ Bus traffic Store/BusWr ¤ BusRd, BusWr Transaction by local actions Transaction by bus traffic

PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 6 is due tonight n The last one! This lecture

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Cap 1 Introduction Introduction What is Parallel Architecture? Why Parallel Architecture?

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Algorithms Parallel Prefix Sums Algorithm Theory WS 2012/13 Fabian Kuhn PRAM Parallel

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Programming for Performance 1 Introduction Rich space of techniques and issues Trade off and

Dynamic Processors Demand Dynamic Operating Systems Sankaralingam Panneerselvam Michael M. Swift

Communicating Processors Past, Present and Future David May Bristol University and XMOS David

BLINC: Multilevel Traffic Classification in the Dark Thomas Karagiannis, UC Riverside

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

Lecture 01 Introduction to Communication Systems I-Hsiang Wang ihwang@ntu.edu.tw National

Data Communications and Networks Communication Systems Info and Apps ITS323: Introduction to

Outline Introduction Background Distributed DBMS Architecture Distributed Database Design