Shared Memory Bus for Multiprocessor Systems Mat Laibowitz and - - PowerPoint PPT Presentation

shared memory bus for multiprocessor systems
SMART_READER_LITE
LIVE PREVIEW

Shared Memory Bus for Multiprocessor Systems Mat Laibowitz and - - PowerPoint PPT Presentation

Shared Memory Bus for Multiprocessor Systems Mat Laibowitz and Albert Chiou Group 6 Shared Memory Architecture Memory CPU ? ? ? ? ? ? ? ? CPU CPU Memory We want multiple processors to share memory Question: How do we


slide-1
SLIDE 1

Shared Memory Bus for Multiprocessor Systems

Mat Laibowitz and Albert Chiou Group 6

slide-2
SLIDE 2

Shared Memory Architecture

CPU Memory CPU

  • We want multiple processors to share memory

Question: How do we connect them together?

CPU Memory

?

?

?

?

?

? ?

?

slide-3
SLIDE 3

Shared Memory Architecture

CPU CPU Memory CPU CPU Memory CPU Memory CPU Memory

Single, large memory Multiple smaller memories

Issues

  • Scalability
  • Access Time
  • Cost
  • Application: WLAN vs Single chip multiprocessor
slide-4
SLIDE 4

Cache Coherency Problem

$ $ 1

  • Each cache needs to

correctly handle memory accesses across multiple processors

  • A value written by one

processor is eventually visible by the other processors

  • When multiple writes happen

to the same location by multiple processors, all the processors see the writes in the same order.

slide-5
SLIDE 5

Snooping vs Directory

CPU A CPU B Memory CPU C

CPU A CPU C CPU B

= M = I = I

CPU A CPU B Memory CPU C

CPU A

= M !!!

slide-6
SLIDE 6

MSI State Machine

I S M

CPUWr/ RingInv CPURd/ RingRd CPUWr/ RingInv CPUWr/-- CPURd/-- RingInv/-- CPUWr/-- CPURd/-- RingInv/-- DataMsg RingInv/-- DataMsg

slide-7
SLIDE 7

MSI Transition Chart

Pending->1

  • WriteBack

2 M Pending->0 & Pass Token

  • WriteBack

1 M PASS

  • WriteBack

M PASS

  • Write(Miss)

M Add DATA; Cache->I & PASS

  • Write(Hit)

M PASS

  • Read(Miss)

M Add DATA; Cache->S & PASS

  • Read(Hit)

M

  • Write(Hit)
  • M
  • Read(Hit)
  • M

Pending->0, Pass Token

  • WriteBack

1 S Modify Cache; Cache->M & Pass Token

  • Write

1 S PASS

  • WriteBack

S PASS

  • Write(Miss)

S Add DATA; Cache->I & PASS

  • Write(Hit)

S PASS

  • Read(Miss)

S Add DATA; PASS

  • Read(Hit)

S Pending->1; SEND Write Write

  • S
  • Read(Hit)
  • S

DATA/M->Cache, Modify Cache; SEND WriteBack(DATA), SEND WriteBack(data), Pending->2 Write (M) I & Miss DATA/M->Cache, Modify Cache; SEND WriteBack(DATA)

  • Write (I/S)

1 I & Miss DATA/S->Cache; SEND WriteBack(DATA)

  • Read

1 I & Miss PASS

  • WriteBack

I & Miss PASS

  • Write

I & Miss PASS

  • Read

I & Miss Pending->1; SEND Write Write

  • I & Miss

Pending->1; SEND Read Read

  • I & Miss

Actions Incoming Processor Transaction Incoming Ring Transaction Pending State Cache State

slide-8
SLIDE 8

Ring Topology

CPU 1 Cache Controller 1 Memory Controller

Cache 1

CPU 2 Cache Controller 2 CPU n Cache Controller n

  • Cache 2

Cache n

Memory

slide-9
SLIDE 9

Ring Implementation

  • A ring topology was chosen for speed and

its electrical characteristics

– Only point-to-point – Like a bus – Scaleable

  • Uses a token to ensure sequential

consistency

slide-10
SLIDE 10

Test Rig

mkMSICacheController

waitReg pending token ringOut FIFO ringIn FIFO request FIFO response FIFO mkMSICache

=

rules

$ Controller

slide-11
SLIDE 11

Test Rig

mkMSICacheController waitReg pending token ringOut FIFO ringIn FIFO request FIFO response FIFO mkMSICache

$ Controller

=

rules

mkMultiCacheTH mkMultiCache mkDataMemoryController

ringIn FIFO toDMem FIFO fromDMem FIFO

mkDataMem

dataReqQ FIFO dataRespQ FIFO

$ Controller $ Controller $ Controller

  • ringOut

FIFO

Client Client Client

rule rule rule rule rule

slide-12
SLIDE 12

Test Rig (cont)

  • An additional module was implemented that takes a single stream
  • f memory requests and deals them out to the individual cpu data

request ports.

  • This module can either send one request at a time, wait for a

response, and then go on to the next cpu or it can deal them out as fast as the memory ports are ready.

  • This demux allows individual processor verification prior to multi-

processor verification.

  • It can then be fed set test routines to exercise all the transitions or

be hooked up to the random request generator

slide-13
SLIDE 13

=> Cache 2: toknMsg op->Tk8 => Cache 5: toknMsg op->Tk2 => Cache 3: ringMsg op->WrBk addr->0000022c data->aaaaaaaa valid->1 cache->1 => Cache 3: getState I => Cache 1: newCpuReq St { addr=00000230, data=ba4f0452 } => Cache 1: getState I => Cycle = 56 => Cache 2: toknMsg op->Tk7 => Cache 6: ringMsg op->Rd addr->00000250 data->aaaaaaaa valid->1 cache->6 => DataMem: ringMsg op->WrBk addr->00000374 data->aaaaaaaa valid->1 cache->5 => Cache 6: getState I => Cache 8: ringReturn op->Wr addr->000003a8 data->aaaaaaaa valid->1 cache->7 => Cache 8: getState I => Cache 8: writeLine state->M addr->000003a8 data->4ac6efe7 => Cache 3: ringMsg op->WrBk addr->00000360 data->aaaaaaaa valid->1 cache->4 => Cache 3: getState I => Cycle = 57 => Cache 6: toknMsg op->Tk2 => Cache 3: toknMsg op->Tk8 => Cache 4: ringMsg op->WrBk addr->0000022c data->aaaaaaaa valid->1 cache->1 => Cache 4: getState I => Cycle = 58 => dMemReq: St { addr=00000374, data=aaaaaaaa } => Cache 3: toknMsg op->Tk7 => Cache 7: ringReturn op->Rd addr->00000250 data->aaaaaaaa valid->1 cache->6 => Cache 7: writeLine state->S addr->00000250 data->aaaaaaaa => Cache 7: getState I => Cache 1: ringMsg op->WrBk addr->00000374 data->aaaaaaaa valid->1 cache->5 => Cache 1: getState I => Cache 4: ringMsg op->WrBk addr->00000360 data->aaaaaaaa valid->1 cache->4 => Cache 4: getState I => Cache 9: ringMsg op->WrBk addr->000003a8 data->aaaaaaaa valid->1 cache->7 => Cache 9: getState I => Cycle = 59 => Cache 5: ringMsg op->WrBk addr->0000022c data->aaaaaaaa valid->1 cache->1 => Cache 5: getState I => Cache 7: toknMsg op->Tk2 => Cache 3: execCpuReq Ld { addr=000002b8, tag=00 } => Cache 3: getState I => Cache 4: toknMsg op->Tk8 => Cycle = 60 => DataMem: ringMsg op->WrBk addr->000003a8 data->aaaaaaaa valid->1 cache->7 => Cache 2: ringMsg op->WrBk addr->00000374 data->aaaaaaaa valid->1 cache->5 => Cache 2: getState I => Cache 8: ringMsg op->WrBk addr->00000250 data->aaaaaaaa valid->1 cache->6 => Cache 8: getState I => Cache 5: ringReturn op->WrBk addr->00000360 data->aaaaaaaa valid->1 cache->4 => Cache 5: getState S => Cycle = 61 => Cache 5: toknMsg op->Tk8

Trace Example

slide-14
SLIDE 14

Design Exploration

  • Scale up number of cache controllers
  • Add additional tokens to the ring allowing basic

pipelining of memory requests

  • Tokens service disjoint memory addresses

(ex. odd or even)

  • Compare average memory access time versus

number of tokens and number of active CPUs

slide-15
SLIDE 15

Number of Controllers vs. Avg. Access Time (2 Tokens)

5 10 15 20 25 30 3 6 9

Number of Controllers Average Access Time (clock cyles)

Test Results

slide-16
SLIDE 16

Test Results

Number of Tokens vs. Avg. Access Time (9 Controllers)

5 10 15 20 25 30 2 4 8

Number of Tokens Average Access Time (clock cycles)

slide-17
SLIDE 17

Placed and Routed

slide-18
SLIDE 18

Stats (9 cache, 8 tokens)

  • Clock speed: 3.71ns (~270 Mhz)
  • Area: 1,296,726 µm2 with memory
  • Average memory access time: ~39ns