1
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
Cap5 - Shared Memory Multiprocessors Logical design and software - - PowerPoint PPT Presentation
Adaptado dos slides da editora por Mario Crtes IC/Unicamp Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main
1
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
2
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 269
3
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– can be very high performance since no OS involvement necessary; controle
Multipr
Shar ed addr ess space Message passing Pr
Communication abstraction User/system boundary Compilation
Operating systems support Communication har dwar e Physical communication medium Har dwar e/softwar e boundary
pag 270
4
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
I/O devices Mem P
1
$ $ P
n
P
1
Sw itch Main memory P
n
(Interleaved) (Interleaved) P
1
$
Interconnection netw ork $ P
n
Mem Mem (b) Bus-based shared memory (c) Dancehall (a) Shared cache First-level $ Bus P
1
$ Interconnection netw ork $ P
n
Mem Mem (d) Distributed-memory
pag 270-271
5
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– They’ll keep accessing stale value in their caches
pag 272
6
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 272
7
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– Coherence and Consistency – Snooping Cache Coherence Protocols
8
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– (a) uncacheable memory (marcar segmento da memória reservado para IO),
pag 273-274
9
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– Processes accessing main memory may see very stale (velho) value – valor na memória depende do instante em que bloco é descartado e atualizado
I/O devices Memory P
1
$ $ $ P
2
P
3
1 2 3 4 5 u = ? u = ? u:5 u:5 u:5 u = 7
1.P1 lê Mem(u); $1 2.P3 lê Mem(u); $3 3.P3 Wr 7 -> $3(u) e Mem(u); write through 4.P1 lê $1 (5??) 5.P2 lê Mem(u); $2
pag 273-274
10
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 275
11
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 275-6
12
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 276
13
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– if I see w1 after w2, you should not see w2 before w1 – no need for analogous read serialization since reads not visible to
pag 277
14
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– RD: seguido pela transferência do dado – WR: depende (dado junto com endereço ou depois?)
pag 279
15
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 277
16
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– no new states or bus transactions in this case – invalidation- versus update-based protocols
– inval causes miss on later access, and memory up-to-date via write-through
I/O devices Mem P
1
$ Bus snoop $ P
n
Cache-memory transaction
17
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– state of a block can be seen as p-vector (p= nº de caches)
– other blocks can be seen as being in invalid (not-present) state in that cache
– can have multiple simultaneous readers of block, but write invalidates them
Pr
Bus-snooper -initiated transactions I V PrRd/BusRd PrRd/— PrW r/BusWr BusW r/— PrW r/BusW r
Observado / transação gerada
pag 280
18
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 281
19
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– most recent write by this processor, or – most recent read miss by this processor
pag 282
20
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
are issued by the same processor and M2 follows M1 in program order.
W.
for the write follows that for M.
not already separated from the write by another bus xaction.
(podem haver bus xactions de read misses, desde que na ordem local)
–any order among reads between writes is fine, as long as in program order
R W R R R R R R R R W R R R R R R R P
0:
P
1:
P
2:
pag 282-3
21
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– Each processor generates 30M stores/sec (200E6 ciclos * 0,15)
– 1GB/s bus can support only about 4 processors without saturating – Write-through especially unpopular for SMPs
pag 282-3
22
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– to preserve orders among accesses to same location by different processes
–Typically use event synchronization, by using more than one location
P
1
P
2
/*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A;
pag 283-4
23
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– across different locations as well – so programmers can reason about what results are possible
1
2
pag 284
24
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 285
25
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
Processors issuing memory references as per program order P 1 P 2 P n Memory The “switch” is randomly set after each memory r eference
pag 286
26
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 286-7
27
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– possible outcomes for (A,B): (0,0), (1,0), (1,2); impossible under SC: (0,2) – we know 1a->1b and 2a->2b by program order – A = 0 implies 2b->1a, which implies 2a->1b (2a, 2b, 1a, 1b) – B = 2 implies 1b->2a, which leads to a contradiction (1a, 1b, 2a, 2b) – BUT, actual execution 1b->1a->2b->2a is SC, despite not program order
– actual execution 1b->2a->2b-> is not SC (pois produziria AB =02)
1
2
pag 287
28
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– memory operations issued by a process must appear to become visible (to
– in the overall total order, one memory operation should appear to complete
– needed to guarantee that total order is consistent across processes – tricky part is making writes atomic
pag 288
29
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
2
3
pag 288
30
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 288
31
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
1.
2.
3.
pag 289
32
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 290
33
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– since it caused a bus transaction
pag 291
34
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 291
35
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– makes the write visible, i.e. write is performed – may be actually observed (by a read miss) only later – write hit made visible (performed) when block updated in writer’s cache
– note: replaced block that is not in modified state can be dropped
pag 292
36
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– In invalidation protocols, they would miss and cause more transactions
– Also, only the word written is transferred, not whole block
– In invalidation, first write gets exclusive ownership, others local
pag 292
37
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– and clears out copies that won’t be used again
– single bus transaction to update all copies
pag 293
38
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
Bus Transactions
cache ou diferente de M); ); uma cache ou a Memória fornecem; todos são inv.
bloco “M”); não afeta o processador (somente cache Mem) Actions
pag 293
39
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
PrRd/— PrRd/— PrW r/BusRdX BusRd/— PrW r/— S M I BusRdX/Flush BusRdX/— BusRd/Flush PrRd/BusRd PrW r/BusRdX
Se outra cache tem o dado em S, não faz nada (memória fornece o dado); se está no estado M, esta cache fornece o dado (flush) e M -> S; tanto a cache solicitante quanto a memória pegam o dado
inteiro e modifica a palavra em questão; RdX ; todas
M
dado que retorna do RdX pode ser ignorado porque já na cache; simplificação seria usar uma nova transação: Bus Upgrade (BusUpgr); esta transação também obtém exclusividade mas não causa fornecimento de dados por ninguém
pag 294
40
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– Write performed in writer’s cache before it handles other transactions, so
– sequence of such writes between two bus xactions for the block must come
– in serialization, the sequence appears between these two bus xactions – reads by P will see them in this order w.r.t. other bus transactions – reads by other processors separated from sequence by a bus xaction, which
– so reads by all processors see writes in same order
pag 297
41
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– A memory operation Mj is subsequent to Mi if
– Writes from other processors by the previous bus xaction P issued – Writes from P by program order
pag 297-8
42
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– good for mostly read data – what about “migratory” data
pag 298
43
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– e.g. even in sequential program – BusRd (I->S) followed by BusRdX or BusUpgr (S->M)
– invalid – exclusive or exclusive-clean (only this cache has copy, but not modified) – shared (two or more caches may have copies) – modified (dirty)
– needs “shared” signal on bus: wired-or line asserted in response to BusRd
pag 299
44
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– outras caches fazem ação normal (SS ou S I)
PrW r/— BusRd/Flush PrRd/ BusRdX/Flush PrW r/BusRdX PrW r/— PrRd/— PrRd/— BusRd/Flush E M I S PrRd BusRd(S) BusRdX/Flush BusRdX/Flush BusRd/ Flush PrW r/BusRdX PrRd/ BusRd (S )
ES
pag 301
45
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 300
46
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– Sm and Sc can coexist in different caches, with only one Sm
– diferente de BusRD: linha inteira da cache
pag 302
47
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
E Sc Sm M PrW r/— PrRd/— PrRd/— PrRd/— PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrW r/— PrW rMiss/(BusRd(S); BusUpd) PrW rMiss/BusRd(S) PrW r/BusUpd(S) PrW r/BusUpd(S) BusRd/— BusRd/Flush PrRd/— BusUpd/Update BusUpd/Update BusRd/Flush PrW r/BusUpd(S) PrW r/BusUpd(S)
48
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 304
49
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– want a balanced system: no expensive resource heavily underutilized
– transcends architectural details, but not what we’re really after
– Cheap simulation: no need to model contention
pag 305
50
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
T r a f f i c ( M B / s ) T r a f f i c ( M B / s ) x d l t x I l l t E x 2 4 6 8 1 1 2 1 4 1 6 1 8 2 D a t a b u s A d d r e s s b u s E E 1 2 3 4 5 6 7 8 D a t a b u s A d d r e s s b u s Barnes/III Barnes/3St Barnes/3St-RdEx LU/III Radix/3St-RdEx LU/3St LU/3St-RdEx Radix/3St Ocean/III Ocean/3S Radiosity/3St-RdEx Ocean/3St-RdEx Radix/III Radiosity/III Radiosity/3St Raytrace/III Raytrace/3St Raytrace/3St-RdEx
Appl-Code/III Appl-Code/3St Appl-Code/3St-RdEx Appl-Data/III Appl-Data/3St Appl-Data/3St-RdEx OS-Code/III OS-Code/3St OS-Data/3St OS-Data/III OS-Code/3St-RdEx OS-Data/3St-RdEx
de BusUpgr
cópia exclusiva para alteração
exclusivo, mas não alterará, portanto não recebe cópia
pag 312
51
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– latter due to granularity of coherence being larger than a word
– increase misses due to false sharing if spatial locality not good – increase misses due to conflicts in fixed-size cache – increase traffic due to fetching unnecessary data and due to false sharing – can increase miss penalty and perhaps hit cost pag 313
52
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
RD
Miss classi cation Reason for miss First refer ence to memory block by pr
First access systemwide Yes No Written befor e Yes No Modi ed word(s) accessed during lifetime Yes No
rue-sharing-
Reason for elimination of last copy Replacement Invalidation Old copy with state = invalid still ther e Yes No
e-
e-
rue-sharing- inval-cap
inval-cap Modi ed word(s) accessed during lifetime Modi ed word(s) accessed during lifetime Yes No Yes No false-sharing true-sharing Has block been modi ed since replacement No Yes
rue-sharing-
e-
rue-sharing-
Modi ed word(s) accessed during lifetime Modi ed word(s) accessed during lifetime Ye s No Yes No capacity Other cold cold cap-inval cap-inval capacity
(ambos recursos)
pag 317
53
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
C
d C a p a c i t y T r u e s h a r i n g F a l s e s h a r i n g U p g r a d e 8 . 1 . 2 . 3 . 4 . 5 . 6 C
d C a p a c i t y T r u e s h a r i n g F a l s e s h a r i n g U p g r a d e 8 6 2 4 8 6 8 2 4 6 8 1 1 2
Miss rate (%) Barnes/8 Barnes/16 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Miss rate (%) Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/128 Ocean/256 Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256
54
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– So total traffic often minimized at 16-32 byte block, not smaller
Traffic (bytes/instruction) Traffic (bytes/FLOP) Data bus Address bus Data bus Address bus Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 1 2 3 4 5 6 7 8 9 10 LU/8 LU/16 LU/32 LU/64 LU/128 LU/256 Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/128 Ocean/256 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
2 4 2 8 . 2 . 4 . 6 . 8 . 1 . 1 2 . 1 4 . 1 6 . 1 8 D a t a b u s A d d r e s s b u s
Barnes/16 Traffic (bytes/instructions) Barnes/8 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256
pag 326
55
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– use subblocks: same tag but different state bits – one subblock may be valid but another invalid or dirty
– But can change consistency model: discuss later in course
pag 328
56
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– e.g. producer-consumer pattern
– “pack rat” (rato trocador) phenomenon particularly bad under process
– useless updates where only last one will be used
pag 329
57
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
Miss rate (%) Miss rate (%) LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd 0.00 0.10 0.20 0.30 0.40 0.50 0.60 Cold Capacity True sharing False sharing Radix/inv Radix/mix Radix/upd 0.00 0.50 1.00 1.50 2.00 2.50
pag 332
58
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– many bus transactions
– could delay updates or use
– bandwidth, complexity,
LU/inv LU/upd Ocean/inv Upgrade/update rate (%) Upgrade/update rate (%) Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd . . 5 1 . 1 . 5 2 . 2 . 5 Radix/inv Radix/mix Radix/upd . 1 . 2 . 3 . 4 . 5 . 6 . 7 . 8 .
pag 333
59
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– point-to-point – group – global (barriers)
pag 334
60
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– but it’s goes against the “RISC” flow,and has other problems
– load-locked, store-conditional – later used by PowerPC and DEC Alpha too
pag 334
61
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– Acquire right to the synch (enter critical section, go past event)
– Wait for synch to become available when it isn’t
– Enable other processors to acquire right to the synch
pag 335
62
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 335
63
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 336
64
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 336
65
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– few locks can be in use at a time (one per lock line) – hardwired waiting algorithm (normalmente busy-wait seguido de abort
pag 337
66
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 338
67
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 339
68
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
lock: t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */
– Three operands: location, register to compare with, register to swap with – Not commonly supported by RISC instruction sets
pag 339
69
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
On SGI Challenge. Code: lock; critical section (delay(c)); unlock; Same total no. of lock calls as p increases; measure time per lock transfer
s s s s s s s s s s s s s s s s l l l l l l l l l l l l l l l l n n n n n n n n n n n n n n n n u u u u u u u u u u u u u u u u
Number of processors T ime ( s) 11 13 15 2 4 6 8 10 12 14 16 18 20
s
T est&set, c = 0
l
T est&set, exponential backof f, c = 3.64
n
T est&set, exponential backof f, c = 0
u
Ideal 9 7 5 3
/ unlock, excluindo a seção crítica
curva de cima = dependência de tempo e contenção
degrades because unsuccessful test&sets generate traffic (sempre há
escrita na variável lock na cache na fase de espera)
pag 341
70
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
i
– cached lock variable will be invalidated when release occurs
– only one attemptor will succeed; others will fail and start testing again
pag 342
71
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 343
72
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 344
73
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 345
74
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– valuable when using test&set instructions; LL-SC does it already
– valuable with LL-SC too
pag 346
75
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– atomic op when arrive at lock, not when it’s free (so less contention)
– like simple LL-SC lock, but no inval when SC succeeds, and fair
– exponential backoff not a good idea due to FIFO order – backoff proportional to now-serving - next-ticket may work well
pag 347
76
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– fetch&inc to obtain address on which to spin (next array element) (com
– ensure that these addresses are in different cache lines or memories
– set next location in array, thus waking up process spinning on it (somente
– array location I spin on not necessarily in my local memory (solution later)
pag 347
77
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– Not so with delay between unlock and next lock – Need to be careful with backoff
l
A r r a y
a s e d
6
L L
C
n
L L
C , e x p
e n t i a l
u
T i c k e t
s
T i c k e t , p r
t i
a l
l l l l l l l l l l l l l l l 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n n n n n n n n n n n n n n n u u u u u u u u u u u u u u u s s s s s s s s s s s s s s s
1 1 3 5 7 9 1 1 1 3 1 5 1 3 5 7 9 1 1 1 3 1 5 1 3 5 7 9 1 1 1 3 1 5 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
l l l l l l l l l l l l l l l 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n n n n n n n n n n n n n n n u u u u u u u u u u u u u u u s s s s s s s s s s s s s s s l l l l l l l l l l l l l l l 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n n n n n n n n n n n n n n n u u u u u u u u u u u u u u u s s s s s s s s s s s s s s s
(a) Null (c = 0, d = 0) (b) Critical-section (c = 3.64 s, d = 0) (c) Delay (c = 3.64 s, d = 1.29 s) Time ( s) Time ( s) Time ( s) Number of processors Number of processors Number of processors
pag 349
78
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– producer: write if empty, set to full; consumer: read if full; set to empty
– multiple consumers, or multiple writes before consumer reads? – needs language support to specify when to use – composite data structures?
pag 352
79
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– even harder with multiple processes per processor
– e.g. latter due to process migration
pag 358
80
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 354
81
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
Consecutively entering the same barrier doesn’t work
atrasado (por ex pelo OS) pode ficar preso na 1a barreira); é retirado (esperou demais) pelo OS (swapped), quando volta vê o flag em 0 sinalizando espera na barreira, mas já é a barreira seguinte; deadlock na primeira barreira
Sense reversal: wait for flag to take different value consecutive times
pag 355
82
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 356
83
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
Flat Tree structured Contention Little contention pag 356
84
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– Will discuss fancier barrier algorithms for distributed machines
– Also for spinning on highly contended locks
Number of processors T ime ( s)
l l l l l l l l u u u u u u u u s s s s s s s s n n n n n n n n
1 2 3 4 5 6 7 8 5 10 15 20 25 30 35
l
Centralized
u
Combining tree
s
T
n
Dissemination pag 357
85
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 358
86
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– write sharing: tráfego por invalidate; e também provável proteção por
pag 359
87
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Capacity-generated traffic (including conflicts)
Bus traf fic
True sharing (inherent communication) Cold-start (compulsory) traffic
Cache size
False sharing
Second working set First working set
esses 3 tipos de miss geram tráfego mesmo com cache infinita
pag 359
88
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
C
t i g u i t y i n m e m
y l a y
t C a c h e b l
k s t r a d d l e s p a r t i t i
C a c h e b l
k i s w i t h i n a p a r t i tion b
n d a r y ( a ) T w
i m e n s i
a l a r r a y ( b ) F
r
i m e n s i
a l a r r a y P
1
P P
2
P
3
P
5
P
6
P
7
P
4
P
8
P
2
P
3
P
5
P
6
P
7
P
4
P
8
P P
1
pag 360
89
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
C a c h e e n t r i e s P
1
P P
2
P
3
P
5
P
6
P
7
P
4
P
8
Locations in subrows and Map to the same entries (indices) in the same cache. The rest of the processor’s cache entries are not mapped to by locations in its partition (but would have been mapped to by subrows in other processor’s partitions) and are thus wasted.
pior caso: mapeamento direto, e linha da matriz de dados = tamanho da cache
pag 362
90
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
Appl-Code/64 Traffic (bytes/instruction) Appl-Code/128 Appl-Code/256 Appl-Data/64 Appl-Data/128 Appl-Data/256 OS-Code/64 OS-Code/128 OS-Code/256 OS-Data/64 OS-Data/128 OS-Data/256 0.1 0.2 0.3 0.4 0.5 0.6 Data bus Address bus
Figura anterior no livro mas não apresentada nas transparências (fig 5.25)
91
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 364
92
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
pag 366
93
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
94
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
95
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
96
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– allocate (local) list element and enqueue on list – spin on flag field of that list element
– set flag of next element on list
– swap is sufficient, but lose FIFO property – FIFO – spin locally (cache-coherent or not) – O(1) network transactions even without consistent caches – O(1) space per lock – but, compare&swap difficult to implement in hardware
97
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– could be much longer than a memory access
98
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– wait for SC to go to directory and get ownership (long latency) – have LL load in exclusive mode, so SC succeeds immediately if still in
99
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
100
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
10 14 18 22 20 24 24 P0 P1 P2 P3 P5 P7 P6 P4 P0 P1 P2 P3 P4 P5 P6 P7 5 10 20 15 Time
g g L L L L L L L
L=6, o=2, g=4, P=8
L
Model: Latency, Overhead, Gap
101
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
102
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
– could use combining wakeup tree
103
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
104
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
105
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
106
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
107
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
108
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
109
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp
I/O devices Memory P
1
$ $ $ P
2
P
3
1 2 3 4 5 u = ? u = ? u:5 u:5 u:5 u = 7
1. P1 lê Mem(u); $1 2. P3 lê Mem(u); $3 3. P3 Wr 7 -> $3(u) e Mem(u); write through; controlador de $3 gera bus transaction -> controlador de $1 invalida $1(u) 4. P1 lê $1 -> miss -> lê valor atualizado da memória 5. P2 lê $2 -> miss -> lê valor atualizado da memória
110
Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp