Architecture Caches Main memory Interconnect system System/Memory - - PowerPoint PPT Presentation

architecture
SMART_READER_LITE
LIVE PREVIEW

Architecture Caches Main memory Interconnect system System/Memory - - PowerPoint PPT Presentation

Calcolatori Elettronici e Sistemi Operativi System architecture CPU(s) Memory hierarchy Architecture Caches Main memory Interconnect system System/Memory bus I/O busses ata, eisa, i2c, ide, pci, pcmcia, scsi, spi,


slide-1
SLIDE 1

Architecture

Calcolatori Elettronici e Sistemi Operativi

System architecture

CPU(s) Memory hierarchy

– Caches – Main memory

Interconnect system

– System/Memory bus – I/O busses

ata, eisa, i2c, ide, pci, pcmcia, scsi, spi, usb, ...

System architecture

Peripherals (I/O devices)

– User I/O

Audio, Video, Keyboard, Mouse Other HID (human interface device)

– Touchscreen, Joystick, ...

– Communication I/O

Serial port, Parallel port Networking

– Bluetooth, Firewire, Infiniband, Ethernet, Wireless, Irda, NFC, ...

– Storage

Hard disks, CD-ROMs, Solid-state drives, MMC/SD cards, ...

System architecture

IRQ controllers Timers Other HW

– Clock management – Voltage management – GPIO – Leds – Crypto devices – HW monitors

Temperature sensors, Voltage sensors, Fan speed sensors, ...

slide-2
SLIDE 2

System architecture (example)

P cache Memory

System bus

HW I/O

I/O bus

I/O I/O I/O

Architecture

Memory hierarchy

Memory Hierarchy

CPU / Memory speed mismatch

– Memory access time

1-2 cycles high cost (area/energy/$) 5-10 cycles 50-500 cycles

– Many accesses for small areas

Program characteristics:

– Predictability / Structure / Linear data structures / Sequential flow

Principle of locality / Locality of reference

– Temporal locality – Spatial locality

Memory Hierarchy

Temporal locality

– an accessed memory location is likely to be accessed

again in the near future

Spatial locality

– after accessing memory location X, it is probable that

a program will access locations X±1, X±2, X±n (n small)

slide-3
SLIDE 3

Memory Hierarchy

Big Cheap Slow Small Expensive Fast

Memory Hierarchy

CPU registers Cache (one o more levels, on- and off-chip) RAM Mass storage (HDD, Flash) Backups (Tape)

Memory structure

Memory Array

Row Decoder Address Line precharge Sense amplifiers Data out Data in write read

Memory structure

Memory Array

Row Decoder Address Line precharge Sense amplifiers Data out Data in write read MUX

slide-4
SLIDE 4

SRAM cell

V

DD

GND BL BL_b WL P1 P2 N1 N2 N3 N4

  • 1. precharge bitlines
  • read: VDD/2
  • write: VDD/2+ , VDD/2-
  • 2. address wordlines
  • write: keep bitlines driven

BL BL_b WL N3 N4

DRAM cell

BL WL

Small Destructive read restore data after each read Need refresh

DRAM access

  • DRAM

usually address is sent in two phases (COL and ROW)

COL ROW

  • CAS
  • RAS

DATA

RAS time CAS time Column Address hold time RAS access time

Memory Hierarchy

CPU L1 L2 L3 R A M H D D T A P E

REGS

slide-5
SLIDE 5

Cache

N-1 ADDRESS DATA i 123

Cache-Miss

ADDRESS = i

Cache

N-1 ADDRESS DATA i 123 DATA = 123 ADDRESS = i

Miss Penalty (time/energy)

Cache

N-1 ADDRESS DATA i i 123 123 ADDRESS = i DATA = 123

Cache-Hit <Access> = Accesscache · (1+MR·MP)

time/energy

Cache

Read

– Hit – Miss

Read data from next levels

– Read a whole line (exploits spatial locality)

Write

– Hit

Write-through Write-back

– Miss

Write-allocate Write-no-allocate

slide-6
SLIDE 6

Cache

Associative

– addressed by content

CAM (content addressable memories) standard memories + control

Cache

Associative ADDRESS DATA ADDRESS DATA HIT/MISS COMP LINE V

V: Valid

Cache

Direct-Mapped TAG DATA ADDRESS DATA TAG IDX DSP

{

LINE

MUX

Lsize = 2#DSP

(or block size)

Lines = 2#IDX

(or blocks)

COMP HIT/MISS V

}

more data words in a cache line

Cache

Direct-Mapped TAG DATA ADDRESS DATA TAG IDX DSP

{

LINE

MUX

Lsize = 2#DSP

(or block size)

Lines = 2#IDX

(or blocks)

COMP HIT/MISS

Cache Size = Lines · Lsize (DATA Size) Actual size = Cache Size + (TAG+V) · Lines

V

}

slide-7
SLIDE 7

Cache

Direct-Mapped TAG DATA TAG IDX DSP V 0000 0001 0001 0001 0010 0001 0011 0001

address Memory DM-Cache

More memory addresses are (statically) mapped to the same cache line

Cache

Set-Associative TAG DATA ADDRESS DATA TAG IDX DSP { LINE

MUX

TAG DATA TAG DATA COMP COMP COMP

MUX H1 H2 Hn HIT = H1 + H2 + ... + Hn

V V V

Cache

Set-Associative

#TAG = #ADDRESS - #IDX - #DSP Lines = 2#IDX Lsize = 2#DSP Cache Size = nWays · Lines · Lsize (DATA Size) Actual size = Cache Size + [ (TAG+V) · Lines · nWays ] nWays = 1 Direct-Mapped Lines = 1 Associative

Cache

Replacement

– LRU

counters or shift registers (nWays x Lines) pseudo LRU

– FIFO – Random

slide-8
SLIDE 8

Cache

Replacement

– LRU

counters or shift registers (nWays x Lines)

initial

1 2 3 3 1 2 3 1 2 3 1 3 2 1 3 3 2 1 3 way-0 way-1 way-2 way-3 access counters for line i

Insert the last accessed way, shift other values Reset the last accessed way's counter increment counters below the modified one

initial

1 2 3 3 3 1 2 1 1 3 2 3 3 1 2 3 3 1 2 access LRU stack for line i reg-0 reg-1 reg-2 reg-3

Cache

Replacement

– LRU

pseudo LRU

4 ways, 3 bits (B0, B1, B2) for each line: (B0,B1,B2) = 00x replace way 0 ; (B0,B1,B2) = 11x (B0,B1,B2) = 01x replace way 1 ; (B0,B1,B2) = 10x (B0,B1,B2) = 1x0 replace way 2 ; (B0,B1,B2) = 0x1 (B0,B1,B2) = 1x1 replace way 3 ; (B0,B1,B2) = 0x0

B0 B1 B2

replace way 0 B0 = not B0 B1 = not B1 replace way 1 B0 = not B0 B1 = not B1 replace way 2 B0 = not B0 B2 = not B2 replace way 3 B0 = not B0 B3 = not B3

1 1 1

Cache

Misses (3-C's model)

– Compulsory

cold-start miss

– Capacity

miss in a fully associative cache

– Conflict (Collision)

miss not happened in a fully associative cache

– associative caches do not have conflict misses

too many conflicts: trashing

– conflict misses can avoid capacity misses

Cache

conflict misses can avoid capacity misses

  • repeated, sequential accesses from 0 to B (B+1 bytes)
  • 1. Associative cache, size B, LS=4
  • access to 0: miss (compulsory) insert the whole line (addresses 0,1,2,3)
  • access to 1,2,3: hit
  • access to 4: miss (compulsory) insert the whole line (addresses 4,5,6,7)
  • ...
  • access to B: miss (capacity) replace reference to addresses 0,1,2,3
  • access to 0: miss (capacity) replace reference to addresses 4,5,6,7
  • access to 1,2,3: hit
  • access to 4: miss (capacity)
  • ...

MR ~ 0.25 ( MR = (B/4 + 1)/(B+1) )

  • 2. DM cache, size B, LS=4
slide-9
SLIDE 9

Cache

conflict misses can avoid capacity misses

  • repeated, sequential accesses from 0 to B (B+1 bytes)
  • 1. Associative cache, size B, LS=4

MR = 0.25

  • 2. DM cache, size B, LS=4
  • access to 0: miss (compulsory) insert the whole line (addresses 0,1,2,3)
  • access to 1,2,3: hit
  • access to 4: miss (compulsory) insert the whole line (addresses 4,5,6,7)
  • ...
  • access to B: miss (conflict) replace reference to addresses 0,1,2,3
  • access to 0: miss (conflict) replace reference to addresses 0,1,2,3
  • access to 1,2,3: hit
  • access to 4: hit
  • ...

MR = [0.25 + N · 2/(B+1)] / (N+1) 2/(B+1)

Cache

Rules of thumb

– MR(DMN) ~ MR(2-waysN/2) – 4 x size ½ miss rate – Enlarging Lsize:

decrease MR increase MP

16 32 64 128 256 5 10

MR%

4K 16K 64K 256K

LS

SPEC92

Cache

Stack Distance

– program memory references

addr1, addr2, addr3, ..., addrn

– push references in a stack

(removing from stack if already present)

– stack distance of reference R

position in stack (if present) ∞ (if not present)

Cache

Stack Distance

– program memory references

100, 104, 108, 100, 108 SD(108) = 1

108 100 104 1 2 3 4

slide-10
SLIDE 10

Cache

Stack Distance

– PHIT =

D: stack distance L: Lines W: nWays

Hyp: uniform distribution of cache line accesses

a=0

W 1

  • D

a W L

a

  • LW

L

Da

DM cache PHIT = D=0 PHIT=1 (two consecutive refs) D=1 access sequence: addr, other, addr miss if other replaced addr PMISS = 1/L PHIT = 1-1/L = (L-1)/L D: prob other1, other2, ..., otherD did not evict addr PHIT= PHIT(D=1)D = [(L-1)/L]D

  • L1

L

D

Cache

Multi-level cache

– Inclusive: data in L1 are in L2, in ... too – Exclusive: data are in L1, or in L2, or ... (only one) – Mainly inclusive (intermediate) – Victim cache – L0 cache – Loop cache – ...

slide-11
SLIDE 11

Architecture

Cache coherence

Memory hierarchy Cache coherence

Several caches

– parallel architectures

Same data stored in more than one cache Writes on a cache: what to do with other copies?

Cache coherence

P P P mem mem mem P mem mem

$ $ $ $ datum private copies

enforce data coherence on private changes

Cache coherence

Software based

– compiler or OS support – with or without HW assistance – tough problem

need perfect information

Hardware based

slide-12
SLIDE 12

Cache coherence

Hardware based

– snoop based / directory based

snoop requires shared medium

– write through / write back – invalidate / update

single producer, many consumers

– poor invalidate performance

corei performs multiple writes before corej read data

– update forces useless writes

junk data accumulated in cache (process migration, task switching)

– update forces useless writes

– dirty sharing / no dirty sharing

Cache coherence

Shared bus

– snooping protocols

Other

– directory based coherency

Snoop-based cache coherence

P mem

$

P

$

P

$

P

$ snoop snoop snoop snoop

snoop device

– inspect traffic on the bus and enforce coherence

Snoop-based cache coherence

Cache policy

– write through

CC:

– invalidate – update

– write back

CC:

– invalidate: MSI, MESI (Illinois), MOSI, MOESI, ... – update: Firefly, Dragon, ...

slide-13
SLIDE 13

Snoop-based cache coherence WT-protocols

WTI (write-through invalidate)

– snoop device detect a write reset valid bit – simple – data must be reread

WTU (write-through update)

– snoop device detect a write read data and update the cache

Snoop-based cache coherence MESI protocol

Cache line status

– Modified

data is valid, only in this cache, modified with respect to system memory

– Exclusive

data is valid, only in this cache, consistent with system memory

– Shared

data is valid, in some other cache too, consistent with system memory

– Invalid

data is not valid

Snoop-based cache coherence MESI protocol

Events

– Read-miss

shared

– other snoop devices signal that their caches own the data

exclusive

– no other copies exist in other caches

– Read-hit – Write-hit – Snoop-hit (event caught by the snoop device)

read invalidate

– Cache line replacement

Snoop-based cache coherence MESI protocol

Actions

– Read

read cache line from memory

– Invalidate

broadcast invalidate (all other caches must invalidate their

copies of data)

– Write

write cache line back to memory

slide-14
SLIDE 14

Snoop-based cache coherence MESI protocol

Invalid Shared Modified Exclusive

SHI RMS SHR SHR WH WH RH LRU SHI WM SHR SHI RME

write write write read read read inv

WH

inv

RH RH

RH: read hit WH: write hit SHR: snoop hit on read RMS: read miss, shared WM: write miss SHI: snoop hit on invalidate RME: read miss, exclusive LRU: lru replacement

Snoop-based cache coherence MOESI protocol

I S Owned

RME SHI WH SHI WM SHR WH WH RH SHR SHI RMS SHR RH SHR SHI RH RH WH

E M

MESI + Owned state Owned: data shared among several caches, memory not updated (owner must provide data to others) SHR: provide data on the bus

Snoop-based cache coherence Firefly protocol

Valid-exclusive

– only one copy, memory is consistent

Shared

– several copies, memory is consistent

Dirty

– only one copy, memory is not consistent

Snoop-based cache coherence Firefly protocol

Read miss

– shared copies mark as shared

non-dirty copies data from other caches

  • ne dirty copy data from dirty cache, update memory, dirty cache

becomes shared

– no shared copies mark as valid-exclusive

Write hit

– dirty or valid-exclusive write locally, no status change – shared all other copies are updated

Write miss

– data from memory write locally, status is dirty – data from other caches update all copies (and memory), status is shared

Replacement

– dirty write back in memory

slide-15
SLIDE 15

Snoop-based cache coherence Firefly protocol

Invalid Shared Dirty

Exclusive

RMS SHR SHR WH WH RH LRU WMm SHR RME RH RH

write policy: write-back for private data write-through for shared data

SHW WH WMc

RH: read hit WH: write hit SHR: snoop hit on read RMS: read miss, shared WMm: write miss (data from memory) SHW: snoop hit on write RME: read miss, exclusive WMc: write miss (data from other cache) LRU: lru replacement

Directory-based cache coherence

Interconnect is message-oriented For each data block keep track of

– shared / non-shared

whenever needed send broadcast messages

– owners

send messages only to other owners

Information is

– centralized

bottleneck

– distributed

replication

Software-based coherence

Do not cache shared r/w variables

P1 P2 P3 mem1 mem2 mem3 P4 memsh mem4

$ $ $ $

cacheable (only for core 1) not cacheable (shared data)

Software-based coherence

Do not cache shared r/w variables

P1 P2 P3 P4

$ $ $ $

cacheable area (only for core 1) not cacheable area (shared data)

slide-16
SLIDE 16

Software-based coherence

Do not cache shared r/w variables Cache/do not cache dynamically

slide-17
SLIDE 17

Architecture

Input / Output

Input / Output

Special instructions Memory mapped Polling Interrupt DMA

Input / Output

P cache Memory

HW I/O

Address MEM/IO Data

Special instructions in, out e.g.: in R1,0x0010

Devices are used reading and writing their internal registers Do not cache data from HW

Input / Output

Special instructions in, out e.g.: in R1,0x0010 Memory mapped e.g.: address 0xFE000000 is R0 of HW address 0xFE000004 is R1 of HW

HW

Address Data Enable CLK

Devices are used reading and writing their internal registers Do not cache data from HW P cache Memory

HW I/O

Memory Address Data

slide-18
SLIDE 18

Memory-mapped HW registers

R0 R1 R2 R3

data_in DEC

load_R0 load_R3

address

n 2

MUX data_out enable

=

m-2 0x3F800000 (0xFE000000 >> 2)

write

HW device R0 is mapped to 0xFE000000 R1 is mapped to 0xFE000001 R2 is mapped to 0xFE000002 R3 is mapped to 0xFE000003

Memory-mapped HW registers

R0 R1 R2 R3

data_in DEC

load_R0 load_R3

address

n 2

MUX enable

=

m-4 0x0FE00000 (0xFE000000 >> 4) 2

data_out write

HW device R0 is mapped to 0xFE000000 R1 is mapped to 0xFE000004 R2 is mapped to 0xFE000008 R3 is mapped to 0xFE00000C

Input / Output

Polling

– Device signals to check

e.g.:

– DATAREADY, DEVREADY, ... – H0 is mapped at address 0xFFFF0000 – H0(1:0) = (DATAREADY,DEVREADY) ; hw reg H1: data

MOV R0, 0xFFFF0000 L1: LDR R1, R0 ; read status (hw register H0) TST R1,2 ; data is ready? BEQ L1 ; no: read again MOV R0, 0xFFFF0004 LDR R1, R0 ; read data (hw register H1)

– very simple – CPU time wasted

DATA

DATA READY DEV READY

H0 H1

1 n . . .

Input / Output

Interrupt

– Program the device for data transfer – Execute something else – Get data when the device send a signal (interrupt) – Interrupts have a priority

fast devices high priority interrupts slow devices low priority interrupts

slide-19
SLIDE 19

Input / Output

Interrupt

– usually CPUs have a few IRQ pins

CPU #irqs << #devices

– Shared IRQ lines

daisy chain

– IRQ controllers

Input / Output

Device signals an interrupt Device uses internal registers to show that is waiting to be served CPU reads HW registers to find devices to handle

CPU

HW1 HW2 HW3 Interrupt line ack ack ack

Daisy-Chained Interrupts

Input / Output

Device signals an interrupt Device uses internal registers to show that is waiting to be served CPU reads registers of IRQ controllers to find which device to handle

Interrupt lines Interrupt lines

CPU

IRQ CTRL HW1 HW2 HW3 IRQ CTRL HW4 HW5 HW6

Interrupts controller(s)

Input / Output

Maskable Interrupt (IRQ)

– CPU can ignore interrupts

instructions to mask / unmask interrupts

Non Maskable Interrupt (NMI)

– always received – critical events

parity errors power off

slide-20
SLIDE 20

Input / Output

Interrupt Level-triggered

– interrupt line is kept high until the interrupt is handled – if line is shared, all interrupts must be served

scan devices until a requesting one is found handle interrupt check the interrupt line again

Interrupt Edge-triggered

– interrupt is signaled by a pulse – if line is shared:

check all devices (more pulses can be merged)

– if masked, interrupt can be lost

latch to record pulses

Interrupt Message-signaled

Input / Output

Interrupt handling

  • 1. Finish current instruction
  • 2. Save flags (not always) and return address
  • 3. Signal interrupt handling
  • 4. Find the handling routine

The routine can depend on the interrupt line

  • 5. Jump to routine
  • 1. mask interrupts
  • 2. access device
  • 3. unmask interrupts
  • 4. handle data transfer

CPU is unresponsive during this time

Input / Output

Precise Interrupt

– PC saved in a known position – all instructions up to the current: executed – current instruction: known state – all instructions after the current one: not executed

  • r results discharged

Imprecise interrupt

Input / Output

Example: Interrupts in PC/AT CPU 8259A 8259A

INTR INTA

IMR IRR ISR IMR IRR ISR

  • 1. 8259A: INTR=1
  • 2. CPU: INTA=0 (pulse)
  • 3. CPU: INTA=0 (pulse)
  • 4. 8259A: data on data bus (8 bits)
  • 5. CPU: jump to ISR (depends on data received)

IMR: Interrupt Mask Register IRR: Interrupt Request Register ISR: Interrupt Service Register DATA 8259A: Intel programmable interrupt controller

slide-21
SLIDE 21

Input / Output

Example: Interrupt assignments in PC/AT

  • Master 8259

IRQ0: System timer

IRQ1: Keyboard controller

IRQ2: to slave 8259

IRQ3: 2 serial ports (COM2 and COM4)

IRQ4: 2 serial ports (COM1 and COM3)

IRQ5: parallel port LPT2

IRQ6: floppy disk controller

IRQ7: parallel port LPT1

  • Slave 8259

IRQ8: real-time clock (RTC)

IRQ12: mouse controller

IRQ13: math coprocessor

IRQ14: hd controller 1

IRQ15: hd controller 2

Input / Output

DMA – Device can access the

memory

– After a whole data transfer

is terminated, an interrupt is sent

P cache Memory

HW I/O

Processor write device register to setup transfer memory pointer data size transfer type Device read/write data in memory with its own rate and latency Device send an interrupt

Input / Output

Polling

– simple – computationally expensive

Interrupt

– CPU transfers data from device to memory – an interrupt for each data word

DMA

– an interrupt for each data block