1
Caches
Electronic Computers M
Caches
Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE - - PDF document
Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL) WORKING SET CPU Cache Registers Cache I lev. Cache II lev. Cache III lev. Memory Disk Tape The cache is a memory with an access time some
1
Caches
2
LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL) WORKING SET CPU Registers Cache I lev. Cache II lev. Cache III lev. Memory Disk Tape Cache
Caches
magnitudes shorter than that of the main memory BUT with a size much smaller. It contains a small (see later) replicated portion of the main memory.
it in a cache (hit) and then, when the data is not found, in the main memory (miss)
contiguous addresses (normally 32 or 64 or 128 or more and in any case “aligned” – that is starting at an address multiple of the group size): each group is called a «line»
3
n+2
Memory access time >100 clock cycles Cache access time : 1 to 4 clock cycles
Cache
2 5 m m+1 n n+2 Number of line: The line number (tag) is the complete address of the first line byte minus the LSBits (the bits which define the line size in bytes) which are zeros (alignment!)
Memory
32-256 bytes per line 2 5 m m+1 n Data line Data line address
Line number In line
Processor generated address
In cache position detection Cache line
Data Accessed data range: single byte to the entire line Caches
4 In line offset (0,1,2…31) 5 LSBits of the data/instruction address Caches Let’ consider a cache line of 32 bytes In line offset
The size of the readble data depends on the processor parallelism The cache read/write data MUST be aligned (address multiple of the read data size)
many Cisc computers (this implies that two consecutive accesses are sometimes mandatory – and therefore two cache accesses). Why?
Because this implies that the most significant part of the address must be incremented
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 30 31 The cache line consists of bytes
Please notice that cache offset has nothing to do with page offset
5
the most significant bits of the lower byte address of each line (line number - TAG)
Line number
Data Data Data Data Data Data Data Data
Line number Line number Line number Line number Line number Line number Line number Caches
by mean of a parallel comparison between all cache lines numbers (TAGs) and the CPU MSB address. The comparison can be either successfull (hit) or not (miss)
6
315 TAG Slot 1 Validity 7225 1 7226 2 1 57 m 1 88 n 1 Line Line Line Line
Cache
Line 0 Line 1 Line 2 Line k Line w Line w+1 Line z
Memory
is the line number
byte lines. Offset in line: 8 bit. Tag=36-8= 28 bit
256 bytes/line
Cache size is always a power of 2 as the line size
Line Number
The line number is compared with all cache TAGs . In case of HIT (and if the validity bit is 1) the requested data is present. The address offset is the position of the first byte in the line (requested data can be a byte, a word, a double word and so on provided it is within the line boundary). This cache organization makes the best use of the cache but it is terribly complex since it requires many comparators (if the cache has 1024 slots - in this case the cache size is 256 Kbytes - 1024 28 bit comparators are required!) and normally caches have 64K slots and more. Line number (28 bit) In line
Processor generated address Caches Each cache line has status bits (2 or more). In this case the cache memory is (in bits) 1024 x (28 + 256 x 8 + 2) bits = 2.127.872 bits
Tag Data Status
Caches 7
TAG Slot 1 Validity 1 2 1 m 1 n 1 Line Line Lina Line Line
Cache In each cache slot only a subset of all memory lines can be stored. For instance in slot 0 only those whose line number divided by the slots number has a remainder 0 in slot 1 those with remainder 1 and so on. Obviously the initial memory address of data in each slot is the line number joined with zeroes (LSBits).
Line 0 Line 1 Line 2 Line k Line w Line w+1 Line z
Memory For instance: 1 MB main memory, 64 bytes lines => 16K different lines. If the cache has 128 slots (the cache size is therefore 128 x 64Bytes = 8KBytes) in slot 0 lines number 0, 128, 256, etc., in slot 1 lines number 1, 129, 257 etc.
Caches 8
Line 0 Line 1 Line 2
Memory
Line 3 Line 4 Line 5 Line 6 Line 7 Line 8 Line 9 Line 10 Line 11 Line 12 Line 13 Line 14 Line 15 TAG Slot 1 Validity 1 2 1 Line Line Line
Cache
3 1 Line
9
The LSBs of the line number indicate the only cache slot where the line can be stored. Consider a processor with 36 bit address (64 GB), 256 byte line (8 bit): the line number is 28 bit (how many lines ? 228 -> 210 x 210 x 28). If the cache has 1024 slots (256KB) the 10 LSBs (210 = 1024) of the line number (index) indicate the slot where a line must be stored Only one 10 bit decoder (to detect the involved slot) and only one 18 bit comparator are needed Very little flexibility
In line
TAG (18 bit)
Slot (10bit)
Index
Line number (28 bit) In line
Processor generated address Caches
TAG DATA
Cache
10 Line Offset
TAG
Slot
In each slot only one line for each index can be stored Index
Processor generated address Caches
11
then check)
Line Offset
TAG
Slot
Cache TAG DATA Processor generated address
Caches
Parallelism of the comparators identical to that of directly mapped cache
data can be provided before validity and TAG check . In the set-associative caches only after the check
Requested data Data word Hit/miss
12
ADDRESS
Tag Index Offset
Way 0 Way 1 Way n Way 0 Way 1 Way n Way 0 Way 1 Way n Way 0 Way 1 Way n
Caches
CACHE LINE
Status Tag Data
D E C O D E R TAG COMPARATOR Data selection
13 Caches
a directly mapped cache in
slot, that corresponding to the INDEX
a set-associative cache in any way
the slot corresponding to the INDEX
ult.htm
3.htm
e2.htm
multiport that means that 2 or3 addresses can be presented to the cache which answer
for the DLX
14
Caches are of limited size and therefore it is necessary (i.e. in case of a read miss) to select a line which must be discarded (overwritten if not modified, written back in memory and then overwritten if modified) There are basically three possible policies: RAND (Random), LRU (Least Recently Used), and FIFO (First In First Out) with different efficiency and complexity RAND: in this case the logical network must first detect whether invalid lines are present (and therefore overwrite
generator (i.e. a shift register feedbacked by an EX-OR gate) must select a line to be replaced. The algorithms can be refined selecting first the non-modified lines. Although non-optimal this algorithm is very cost-effective
Caches
15
NB: the same network for each set. When a“hit”occurs the hit way must become the most recent way and all others become of a lower rank with no rank change among them. Let’s suppose there are 4 ways and that all lines of the set are valid. The way (its number) in position Ra is the most recently hit. The other lines were hit in the past according to their positions.
Na, Nb, Nc, Nd the hit ways numbers (0,1,2,3 in any order according to the set history!) Rx,Ra, Rb,Rc,Rd: 2 bit registers Rx stores the present hit way number (if any – no miss)
Na Nb Nc Nd AND AND AND CLK X Ex-OR Ex-OR Ex-OR Ra Rb Rc Rd
Caches Rd stores the way number least recently hit (it stores the oldest line). Its line number is the candidate for replacement in case of miss for the set. Ra stores the way number most recently hit. For each hit the contents of the 2-bits registers are richt shifted one position
Rx Right shift register
ExOR zero if the inputs are identical
16
When a line is invalidated its way number is stored in Rd and all other ways numbers which were hit less recently than the invalidated line are left shifted one position The mechanism is symmetrical to the hit mechanism. For instance in presence of the situation depicted in figure the previous figure (top of this page) line 0 (in register Rb) is invalidated. Line 0 is stored in Rd while line 2 is stored in Rb and line 3 in Rc. In order to deal with the invalidation a symmetrical circuit must be added (the network must shift left until the position of way 0 is reached that is the clock is blocked at Rb) .
Caches
Let’s now suppose a HIT for way 2 and that the way numbers in Ri registers from left to right are 1, 0, 2, e 3. (Way 3 is the replacement candidate in case of set miss). The shift register right shifts until Rc (whose way number is 2) and not beyond because Rd clock is blocked by the Ex-OR. After the clock the Ri registers store (in sequence) 2, 1, 0, 3 (way 3 is still the candidate for replacement while all other way numbers are correctly updated with way 2 as the most recently hit)
2
1 2 3 AND AND AND CLK X Ex-OR Ex-OR Ex-OR Ra Rb Rc Rd
Present status
2 1 3 AND AND AND CLK X Ex-OR Ex-OR Ex-OR Ra Rb Rc Rd
Next status
17
COUNTERS: a counter for each way of each set The counter walues correspond to the way ranking position for replacement: 0-> most recently hit , 3-> least recently hit In most implementation the counters can be incremented or reset. In case of hit of a way number the counters with a lower value are incremented and is reset (valure zero) the counter corresponding to the hit way. In case of miss and replacement the way whose counter is three is selected and then the system behaves as if that way was hit. In case of invalidation the invalidated way counter becomes 3 and all other counters with a greater number (less recently hit) are decremented
It must be noted that the counter algorithm is equivalent to the shift register network. There the position indicates the age rank, here is the counter Caches
Counters values
02) Miss (line fill – Way 2 count 3 replaced)
1 1 1 1 0 1 3 2 Validity
W0 W1 W2 W3 Final status W0 W1 W2 W3 Initial status
Events
01) Hit Way 0
1 1 1 1 1 0 3 2
04) Hit Way 0
1 0 1 1 1 3 0 2
05) Way 3 invalidated
1 0 1 1 0 3 1 2
06) Miss (line fill – Way 3 count3 replaced)
1 0 1 0 0 2 1 3
07) Hit Way 2
1 0 1 1 1 3 2
08) Miss (line fill – Way 2 count 3 replaced)
1 0 1 1 2 3 0 1
09) Miss (line fill – Way 0 count 3 replaced)
1 1 1 1 3 0 1 2
10) Miss (line fill – Way 0 count 3 replaced)
1 1 1 1 0 1 2 3
03) Way 1 invalidated
1 1 1 1 1 2 0 3 1 3 2 1 2 3 1 3 0 2 3 1 2 0 2 1 3 1 3 2 0 2 3 1 3 0 1 2 0 1 2 3 1 2 3 0
18
PSEUDO-LRU (in this example 4 ways)
The 4 set ways are indicated by I0, I1, I2 e I3 In case of miss an invalid line is replaced There are three bits (B0, B1 e B2) for each set If the last set access was for I0 or I1 then B0 =1 otherwise B0=0 If the last access for the two ways I0 and I1 was for I0 then B1=1
I2 then B2=1 otherwise B2=0 In case of replacement According to B0 the cache selects first which couple (I0:I1 or I2:I3) was least recently accessed then selects within the couple the way to be replaced according to B1
B0=0 ? B1=0 ? B2=0 ? Yes NoYes No I2 I3 I0 I1 Replace The algorithm is pseudo-optimal because I1 could be the way least recently accessed but could be «blackened» by I0 if this is the most recently accessed . Yes (I0:I1) least recently accessed No (I2:I3) least recently accessed
Caches
19
FIFO In this implementation there is a single counter for each set modulo n which starting from 0 is incremented for each read miss (that is for each replacement). The new line id inserted in the way pointed by the value of counter.
Caches
This algorithms has a singularity because it does not consider the invalidations. If the counter has value 3 and line in way 2 is invalidated, way 2 and not 3 should be used in case of read miss. Although suboptimal this algorithm has a very good cost/effectiveness ratio.
http://www.ecs.umass.edu/ece/koren/architecture/Cache/frame1.htm http://www.ecs.umass.edu/ece/koren/architecture/PReplace/
20
Processor Cache RAM
Miss Virtual Address MSBs Physical Address MSBs
TLB
Hit Daia
Caches
physical memory addresses. It is addressed by processor virtual
the physical MSBits addresses (the cache line number). LSBits (the
modern processors the TLB (like the caches) has two levels.
the TLB does not exist since the virtual addresses are also the physical addresses
normally 8-16 ways set associative 64-1024 slots
The retrieved data is inserted in the TLB replacing one slot (no write back required!) if all slots used. In case of cache miss the usual procedure
21
Two possible policies: Yes : write-allocate No: no-write-allocate N.B.: Write operation are VERY less frequent than the read
How are lines dealt with in case of write miss ? Read (with possible replacement) and then write ? In case of write-allocate the operation is a read after replacement followed by a line write in cache. In the other case data are written on the following cache level s (if any and containing the line) otherwise in memory
Caches
22
What when a write hit occurs ? Must data be written also in the following cache levels? Two policies Yes : write through No : write back Write-back policy implies that a bit for each line must be present in order to indicate whether the line has been modified (dirty bit)
Caches
In the first case the line is overwritten and data are written in the following cache levels too (down to the memory). In the second case a line is overwritten without forwarding the data to the next level cache (or memory - unless for coherency problems – see later). When a line must be replaced an already overwritten line must be first written back in the following cache level (which could be the memory) since data in the first level are more recent. The data traffic is much smaller (smaller bandwith use) but hadware is more complex It must be underlined that a line is a consistent data structure and therefore even in case of a single byte modification the entire line must be written back.
23
Very often in order to reduce the access time impact the posted-write methodology is used Processor Cache FIFO RAM
check whether the requested data are in the FIFO
Caches
is accessed by the processor (or by the cache in case of write back for replacement) with no delay. The memory controller transfers then data from the buffer to the memory at the memory speed (much lower)
(or cache) is delayed.
24
caches Caches
«agent» (processor, DMA, graphic processor..) upon a read request (please notice that a write of a non present line is preceded by the reading of the line).
belonging to a multiprocessor system but also between different levels caches of the same processor
multiprocessor system have two levels caches (L1 and L2) and the common
to the processor). Let’s suppose the caches are inclusive that is if a line is present in L1 it is present also in L2 (but not viceversa).
P1 Cache 1 Cache 2 P2 Cache 1 Cache 2 BUS Memory
25
Write-back In this case the memory is updated only when necessary (i.e. a replacement). For each external agent access the cache (or the caches) mut be checked in order to verify whether it (they) stores the requested data and if the aswer is positive the agent memory access must be blocked until the requested data are forcedly written back to the memory. Cache snoop mechanism
Caches
How can we grant that an external agent (not the processor) reads from memory the most recent version of data (the data in memory could be stale that is «old») ? Let’ s consider the write policies Write-through For each processor data write (both data present or not present in cache) the data is written also in memory: the coherency is therefore granted but the system is slowed by the memory access time. Posted write-through Similar to the previous case. The processor efficiency is improved (the processor is not normally delayed by the memory access time). No access is allowed to the external agent until data are written to the memory (not easy to implement and little efficient) )
26
Write-back The cache controller must monitor the system bus, and in case of an agent attempt to write must perform the following operations: a) If data are present in cache in a modified state line (or lines) the controller must stop the agent, must trigger a write-back of the modified line and then invalidates the line (lines). It must be noticed that the write-back operation is needed because since a line is made of several bytes there is no way of detecting which byte (or bytes) were modified. The new master could write bytes different from those which were modified b) If line data are not in modified state (the line is coeherent with the memory) upon a write from another master the line must be
Caches
What happens when another agent wants to write data in
memory (or in its caches) ? Write-through The cache controller must monitor the system bus and invalidates the lines (if any) containing the data overwritten by the agent (until then coherent)
27
N.B. The processor has no way of determining whether a secondary cache is present. Signals exchanged with the system must be the same whether a secondary caches exists or not. The same applies for the secondary cache if a third level caches is present. How can we grant that another agent reads from memory the most updated data (if those data were also in a cache, the corresponding data in memory could be «stale» that is «older» than those in cache) ?
Caches
L1 e L2 write-through For each processor write (both in case data is present in cache or absent) data are written down to the memory. This obviously has a great impact on the bus, the most important bottleneck. The write operation by L2 could be deferred. In case of write of another agent data are invalidated (if present) in both cache levels L1 write-through e L2 write-back In this case L2 must monitor the bus and when another agent tries to read a data must first write back the modified data (if any – data are in any case the same in L2 and in L1 – if in L1 are present) in memory. In case of write acces by another agent, modified data must be first written back to memory then invalidated in both caches.
28
L1 and L2 both write-back When the processor reads data (line fill) upon a miss in L1, L2 checks whether it stores the requested data. If yes data are transferred to L1 (with a possible replacement). If the data are present in L2 this means that they are «cacheable». If data are not available in L2, data are requested to the memory controller (MC). If data are «cacheable» a line fill takes place both in L2 and
Caches
In case of a processor write operation with both L1 and L2 write- back there are many cases which depend whether the system is mono- or multi-processor : in any case the system must provide the most update data when they are requested How are these policies implemented? MESI PROTOCOL
29
I – invalid (L1 and L2) The requested line is not available in cache N.B. Lines of a code cache can be only in S or I state At the system start-up all lines in all caches are invalid
Caches
M – modified (L1 and L2) The requested line is available in cache where it was modified without write-back downstream (which is L2 for L1 and memory for L2). The considered cache stores updated data. Notice that if a line is in modified state in L1 and L2 the line in L1 is more updated than the same line of
downstream write E – exclusive (L1 and L2) The considered line is present and identical to the same line present in the device downstream (which is L2 for L1 and memory for L2). A write
. (Careful: the name can be misleading – in case of a multiprocessor it means that the data are present in only one processor) ) S – shared (state possible only for L1 in a monoprocessor system) The line is present in L1 (S), L2 (E) and memory. A write operation triggers a a downstream write upon which L1 state becomes E and L2 state is changed from E to M (no memory write.. see state E). L2 in mono processor systems is never in shared state because there are no agents which need to be informed of the state of the (single) processor internal line (which is not the case of multiprocessor systems)
30
L2 L1 I I M E M Not present E S Not present
NB: Not present: line not present because we consider inclusive caches (and L2>L1). L2 never shared in mono processor systems!!! L1 is always in a state which is related to the state of
Monoprocessor case (with two levels caches)
Caches
31
NB: Since the size of L2 is bigger than the size of L1 it is possible (because of replacements) that a line is not present in L1 but in L2 only either in E or M state. A line fill, therefore in L1 stores the line in L1 respectively in S or E state. In the following slides we assume that all caches are inclusive. The MESI protocol is however applicable also to other cases
Caches
In case of monoprocessor systems a line-fill when data are not present both in L1 and L2 triggers the following state change: L1 state becomes S and L2 state E. A successive write operation to L1 triggers a state change of L1 to E and L2 to M (the L1 written data are also written to L2 because its state is shared). Data are not written to memory A successive write operation affects only L1 whose state becomes M
32
Write operation It triggers an “enquiry” of the Memory Controller in L2. If the line (if any) containing the data is present in L2 in E- state then the same line is in S-state in L1 (if any). The line in L1 and L2 is invalidated. If the line (if any) in L2 is in M-state, L2 must check in L1 whether the line is present in L1 and is in M-state. In any case the most recently updated version of the line is written back to the memory and the line is invalidated in both caches. The the external agent can then write its data in memory. The following cases apply to external agents without private cache (i.e. DMA controller, graphic processor etc.) accessing memory
Caches
Read operation. If L2 line containing the requested data is in E- state then the same line is in S-state in L1 (if any). Memory data are therefore the most updated. Cached data status not changed. If L2 modified, then a check must be made in L1. If in L1 not present or in E state then data in L2 must be written back to the memory, L1 becomes S (if present) and L2 becomes E. If data in L1 modified then its data must be written back to L2 and memory. L1 becomes S and L2 E.
33
Monoprocessor
1) Miss in L1 and not in L2. Line fill from L2 to L1. L1 state depends on L2-state. If L2 state is exclusive, L1 becomes shared; if L2 state is modified L1 becomes exclusive. No chance of a line present in L1 and not in L2
N.B. Why must L1-> S if L2 is exclusive ? Because in case of write if
L1 were in exclusive state no write-back to L2 would take place (L1 E->M) and a memory enquiry would find that the requested data in L2 are identical to those in memory (although stale) and no further enquiry on L1 would take place. A read or write data of an external agent would operate on the memory data without write-back of L1 data (the most recent data)
Caches
2) Miss in L1 e L2 -> double line fill. L1 > shared and L2 -> exclusive
34
a) L1 shared (and therefore L2 exclusive). Write to L1 and L2. L2->M and L>-E. b) L1 exclusive (and therefore necessarily L2 modified). Write to L1 only. L1->M (and L2 M) c) L1 modified (and therefore L2 modified): write to L1 only. L1 remains in M-state (as L2)
Caches
Monoprocessor
>S) then write to both caches (L1-> E and L2->M)
L2 in E state then L1->S otherwise (M2 in M state) L1-E (L2 can
35
External agent READ 1) Miss in L1 and L2 or HIT in LI or L2 both not modified: NOP 2) Hit in L1 modified (and therefore L2 modified): L1 write back to memory and L2. L1->S and L2->E 3) Hit in L2 modified (e L1 exclusive or line not present in L1): L2 write back to memory. L2->E and L1 (if any) ->S
Caches
External agent WRITE 1) Miss in L1 and L2: NOP 2) Hit in L2 and possibly in L1 both not modified: L2->I and L1 (if any) ->I 3) Hit both in L2 and L1 (both modified): write back to memory
36
I – invalid The requested line is not available in cache. A read operation causes a LINE-FILL. A write operation causes a WRITE-THROUGH in case
Caches
M – modified The line is present only in the caches of one processor and in the specified cache it was modified without being written back to the downstream device (is is different from the same line in the downstream device). The line can be read and written without any downstream cycle. E – exclusive The line is present only in the caches of one processor and its content is identical to the downstream device. The line can be read and written without any downstream cycle. A processor write operation provokes a transition to M state. S – shared As before but now L2 too can be in S state. The line is in fact possibly in the caches of many processors. (Possibly because it could be present, for instance, in two processors and then one of them has replaced the line) A write operation causes a downstream write and invalidates the same line in the caches of the other processors, if any.
37
Multiprocessor case (with two levels caches)
L2 L1 I I M E M Not present E S Not present S S Not present
In case of multilevel caches a lower level cache stores a reduced set of the lines of the upper level (inclusive caches). But not always (not inclusive caches) !
Caches
38
1) Miss in L1 but not in L2.
NB: Similar to monoprocessor case but notice that in this case is it possible that both L1 and L2 are in shared state (while in case
Caches
39
2) Miss in L1 and L2 . Bus snoop
not present in caches of another processor in L1 the read line is in shared state and in L2 is in exclusive state
(that is is in shared or exclusive state) upon the snoop all become shared state, The line is read into L1 and L2: in both caches of the requesting processor (as in all caches of the other processors) the state become shared
is in modified state (a line can be in modified state ONLY in
in memory, the hit caches state becomes shared. The line is read into L1 and L2: in both caches the state become
in modified state in the corresponding L2 too !! N.B. A bus snoop is a snoop on L2 which is forwarded to L1 if L2 is in modified state
Caches
40
2) Miss in L1 and not in L2. The line stored in L2 is forwarded to L1 and then written. Three cases a) L2 exclusive. No bus snoop, The line is written in L2 and L1. At the end L2 modified and L1 exclusive. b) L2 modified. At the end L2 modified and L1 exclusive c) L2 shared. Bus snoop with invalidation, read in L1 and L2 and then write operation. L1 exclusive and L2 modified
Caches
Multiprocessor
1) Miss in L1 e L2 . Three cases a) The line is not in caches of other processors: as for the monoprocessor b) The line is present in other caches not modified: all caches containing the line are invalidated. Read in L1 and L2 and then write; final state L1 exclusive and L2 modified c) The line is present in another processor (only one !) in modified
memory and the caches storing the line invalidated (do not forget that both L1 and L2 can be in modified state). The modified line must be first written back because it is not known which data of the line will be rewritten. Then as in the case
the
exclusive and L2 modified.
41
3) Hit in L1 (and therefore in L2). Three cases: a) L1 modified. Only L1 is written b) L1 exclusive. Only L1 is written. L1 modified c) L1 shared. Two cases : I. L2 is shared. Bus snoop with invalidation then write on L1 and L2. L1 exclusive and L2 modified II. L2 exclusive. No bus snoop- Write on L1 and L2. L1 exclusive and L2 modified N.B. There are no cases with L1 shared and L2 modified,
Caches
Caches 42
Only entire lines are transferred S S E
Read non shared line
1 2 3 Level I I I
Write line
S E M
Write line
E M M
Read from another processor
S S S I I I S S S S E M
Read shared line Brodcast Write line Brodcast New read Replecemnt 1st level
(I) E M Cache is inclusive: higher levels have a superset of the lines of the lower levels. (I) means that the line doesn’t exist any more on the
line which is stored in the other levels too. The line remains in its previous state in the other levels E M M
Write line Write from
processor Brodcast
43
Shared: one or more processors caches have the line coeherent with the memory Non cached: no processor cache has the line Modified: Only
case the processor is the temporary owner of the line
M
P1
C1 I/O D M
P2
C2 I/O D M
P3
C3 I/O D M
P4
C4 I/O D
Caches (possibly multilivel)
Caches
Directories “Directory based” coherency protocol Local Memories (dual port memories)
(accessible also from other processors). There is therefore an unique memory addressing system for all memories (true in all modern multiprocessors). Local memories are normally dual-port memories
memory and the processors whose caches (if any) store the line
44
cache stores the line. Two or more 1’s mean that the line is in shared state. A single 1 means that there is a possible owner of the line (the line could be in modified state). If a processor modifies a line in a cache a message is sent to the directory which it belongs to, which in turn sends a message to invalidate the same line in the
In case of a read a message is sent to the owner (if any) which must write back its modified data and the line become shared. In case of write a write-back takes place and then line invalidation of the previous owner (if any)..
M
P
C I/O D M
P
C I/O D M
P
C I/O D M
P
C I/O D
L L L
Home Remote Local L=line
Caches
the processors by reducing the global use of the busses .
Caches 45
Not-unified means that data and instructions are not
are not-unified (Harward architecture). Other levels are unified
46
In order to avoid stalls derived from branches, a branch prediction is necessary in the first stage of the pipeline. The prediciton can be either correct or wrong. In any case the branch is tested in the execution stage. In case of miss a line fill occurs and a replacement procedure is activated. The initial prediction is that occurred in the execution stage P C PC address Destination address T/U PC address Destinatinon address T/U PC address Destination address T/U PC address Destination address T/U PC address Destination address T/U PC address Destination address T/U Branch Target Buffer Taken Untaken
Caches
The BTB is therefore a cache whose TAGs are the addresses of the PCs of instructions detected as branches. The line in this case is the branch destination address and among the status bits there are those who predict whether the branch is Taken or Untaken
47
How is a prediction managed ? On a statistical basis ?
Caches
backward)
(see loops) and the prediction is untaken for forward branches
forward and the static prediction is taken, and therefore the prediction gives better results
«taken» The error probability with this policy, according to SPEC benchmarks, is 34% (fairly high)
48
With only one prediction bit which records the last verified branch. In this case for loop1 there are two successive prediction errors Loop1 Loop2 When loop2 ends (predicted as taken but untaken) there is a following error because in the first following loop loop1 will be predicted as untaken
Caches
49
Normally two bits are used. Two possible schemes
TAKEN TAKEN UNTAKEN TAKEN UNTAKEN UNTAKEN
In this case after two «mispredictions» the prediction is changed (low pass filter)
UNTAKEN TAKEN TAKEN TAKEN UNTAKEN UNTAKEN
In this case after two «mispredictions» the prediction is changed but ready to go back to the previous prediction in case of a further change With both schemes the accuracy is higher than 80%
Caches
UNTAKEN TAKEN UNTAKEN UNTAKEN TAKEN TAKEN TAKEN UNTAKEN UNTAKEN TAKEN
50
Two levels adaptive prediction
Two registers: BHR (Branch History Table) and PHT (Pattern History Table) First case: global approach 1 1 1 1 Ex: BHR (Shift Register) (content = 2Eh= 4710)
History of the most recent n (8 in this example) branches (what really happened , that is whether the branch was verified as either taken or untaken)
1 1 1 1 1 1 1 1 1 1 (00) (01) (2E) (FF) PHT 1 -> Branch taken 0 -> Branch not taken What was predicted with the same global succession (BHR) ?
Caches
8 BHR bits => 28 (25610) => FFHEX PHT slots Decision: taken Decision: untaken
51
In case of a branch the most recent event succession is analysed (whether the branch was really taken or untaken). For each configuration of this succession a pattern is selected which reflects the decisions taken with this succession configuration. After each branch execution the resulted value is stored in the right-shifted BHR A function must be defined which according to the contents
slide) This prediction system (which uses n + (2n x m) FF - where n is the size of the BHR and m that of each PHT slot) is not particularly significant because there is no difference among all branches. Effective but not very precise.
Caches
Caches 52
bits was chosen. There us no correlation between the two sizes which can be indivually arbitrarily chosen by the designer.
and is where the efficiency of the BTB relies.
a. The number of «ones» in the BHT and PHT is counted: if it is greater that 7 than the next prediction is «taken» – 1 – otherwise is «untaken» - 0 - b. The first 5 bits of the BHT and PHT are ex-ored and the number
c. ……….
53
Second case: mixed preditor. K BHT shfit registers (one for each analysed branch) pointing to the same PHT
In this case there is a BHR for each branch considered (K in this case) and a shift register of n bits for each one of the K branches considered while there only one PHT.
K slots n bit K branches considered. Branch (address) BHT
Caches
m bit 2**n slots PHT Two BHT slots can point to the same PHT slot
54
N.B.: registers related to different branches can point to the same PHT register In this case too there is a lack of consistency: while the history
Used FFs: k x n + (2n x m)
Caches
How many FFs?
55
Third case: omogeneous predictor. One different PHT for each BHT A (complex) refinement of the second case m Required FFs: k x n + (2n x m x k) k n 2**n
Caches
Branch (address) BHT How many FFs?