PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant - - PowerPoint PPT Presentation

parallel memory architecture
SMART_READER_LITE
LIVE PREVIEW

PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant - - PowerPoint PPT Presentation

PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 6 is due tonight n The last one! This lecture


slide-1
SLIDE 1

PARALLEL MEMORY ARCHITECTURE

CS/ECE 6810: Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor School of Computing University of Utah

slide-2
SLIDE 2

Overview

¨ Announcement

¤ Homework 6 is due tonight

n The last one! ¨ This lecture

¤ Communication in multiprocessors ¤ Parallel memory architecture ¤ Cache coherence protocol

slide-3
SLIDE 3

Example Code I

¨ A sequential application runs as a single thread

void kern (int start, int end) { int i; for(i=start; i<=end; ++i) { A[i] = A[i] * A[i] + 5; } }

Kernel Function: Memory Processor

A

1 n …

main() { … kern (1, n); … }

Single Thread

slide-4
SLIDE 4

Example Code I

¨ Two threads operating on separate partitions

void kern (int start, int end) { int i; for(i=start; i<=end; ++i) { A[i] = A[i] * A[i] + 5; } }

Kernel Function: Memory Processor

main() { … kern (1, n/2); … }

Thread 0

A

1 n

Processor

kern (n/2+1, n);

Thread 1

slide-5
SLIDE 5

Performance of Parallel Processing

¨ Recall: Amdahl’s law for theoretical speedup

¤ Overall speedup is limited to the fraction of the

program that can be executed in parallel

speedup =

! "#$%&

'

f: sequential fraction

2 4 6 8 10 50 100 150 Speedup Number of Processors

Speedup vs. Sequential Fraction

10% 20% 40% 60% 90%

10x 5x ~2x ~1x

slide-6
SLIDE 6

Example Code II

¨ A single location is updated every time

Kernel Function: Memory Processor Thread 0

A

1 n

main() { … kern (1, n); … } void kern (int start, int end) { int i; for(i=start; i<=end; ++i) { sum = sum * A[i]; } } sum

slide-7
SLIDE 7

Example Code II

¨ Two threads operating on separate partitions

Kernel Function: Memory Processor Thread 0

A

1 n

Processor

kern (n/2+1, n);

Thread 1

main() { … kern (1, n/2); … } void kern (int start, int end) { int i; for(i=start; i<=end; ++i) { sum = sum * A[i]; } } sum

slide-8
SLIDE 8

Communication in Multiprocessors

¨ How multiple processor cores communicate?

Shared Memory Message Passing § Multiple threads employ shared memory § Easy for programmers (loads and stores) § Explicit communication through interconnection network § Simple hardware

Core 1 Core N Shared Memory

Core 1 Core N Mem Mem

Interconnection Network

slide-9
SLIDE 9

Shared Memory Architectures

¨ Equal latency for all

processors

¨ Simple software

control

¨ Access latency is

proportional to proximity

¤ Fast local accesses

Uniform Memory Access Non-Uniform Memory Access

Core 1 Core 4 Memory … Core 1 Mem Router Core 4 Mem Router … Example UMA Example NUMA

slide-10
SLIDE 10

Network Topologies

¨ Low latency ¨ Low bandwidth ¨ Simple control

¤ e.g., bus

¨ High latency ¨ High bandwidth ¨ Complex control

¤ e.g., mesh, ring

Shared Network Point to Point Network

Core 1 Mem Router Core 4 Mem Router … Core 1 Mem Router Core 2 Mem Router Core 4 Mem Router Core 3 Mem Router

slide-11
SLIDE 11

Challenges in Shared Memories

¨ Correctness of an application is influenced by

¤ Memory consistency

n All memory instructions appear to execute in the program

  • rder

n Known to the programmer

¤ Cache coherence

n All the processors see the same data for a particular

memory address as they should have if there were no caches in the system

n Invisible to the programmer

slide-12
SLIDE 12

Cache Coherence Problem

¨ Multiple copies of each cache block

¤ In main memory and caches

¨ Multiple copies can get inconsistent when writes

happen

¤ Solution: propagate writes from one core to others core 1 Core N Cache 1 Cache N

Main Memory

slide-13
SLIDE 13

Scenario 1: Loading From Memory

¨ Variable A initially has value 0 ¨ P1 stores value 1 into A ¨ P2 loads A from memory and sees old value 0

P1 P2

Memory Bus A:0 Cache Cache

slide-14
SLIDE 14

Scenario 2: Loading From Cache

¨ P1 and P2 both have variable A (value 0) in their

caches

¨ P1 stores value 1 into A ¨ P2 loads A from its cache and sees old value

P1 P2

Memory Bus A:0 Cache Cache

slide-15
SLIDE 15

Cache Coherence

¨ The key operation is update/invalidate sent to all

  • r a subset of the cores

¤ Software based management

n Flush: write all of the dirty blocks to memory n Invalidate: make all of the cache blocks invalid

¤ Hardware based management

n Update or invalidate other copies on every write n Send data to everyone, or only the ones who have a copy ¨ Invalidation based protocol is better. Why?

slide-16
SLIDE 16

Snoopy Protocol

¨ Relying on a broadcast infrastructure among caches

¤ For example shared bus

¨ Every cache monitors (snoop) the traffic on the

shared media to keep the states of the cache block up to date

Core Core Memory … LLC L1 L1 Core Core Memory … LLC L1 L1

slide-17
SLIDE 17

Simple Snooping Protocol

¨ Relies on write-through, write no-allocate cache ¨ Multiple readers are allowed ¤ Writes invalidate replicas ¨ Employs a simple state machine for each cache unit

P1 P2

Memory Bus A:0 Cache Cache

slide-18
SLIDE 18

Simple Snooping State Machine

¨ Every node updates its one-bit

valid flag using a simple finite state machine (FSM)

¨ Processor actions

¤ Load, Store, Evict

¨ Bus traffic

¤ BusRd, BusWr

Valid Invalid

Store/BusWr Load/-- Evict/-- Store/BusWr BusWr/-- Load/BusRd Transaction by local actions Transaction by bus traffic