COSC 5351 Advanced Computer Architecture
Slides modified from Hennessy CS252 course slides
COSC 5351 Advanced Computer Architecture Slides modified from - - PowerPoint PPT Presentation
COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides MP Motivation SISD v. SIMD v. MIMD Centralized vs. Distributed Memory Challenges to Parallel Programming Consistency, Coherency, Write
Slides modified from Hennessy CS252 course slides
3/19/2012 2 COSC5351 Advanced Computer Architecture
1 10 100 1000 10000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Performance (vs. VAX-11/780)
25%/year 52%/year ??%/year
3/19/2012 3
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
COSC5351 Advanced Computer Architecture
3/19/2012 4 COSC5351 Advanced Computer Architecture
Flynn classified by data & control streams - 1966 SIMD Data Level Parallelism MIMD Thread Level Parallelism MIMD popular because
3/19/2012 5
M.J. Flynn, "Very High-Speed Computers",
COSC5351 Advanced Computer Architecture
3/19/2012 6 COSC5351 Advanced Computer Architecture
3/19/2012 7
P
1
$
Interconnection network $ P
n
Mem Mem P
1
$ Interconnection network $ P
n
Mem Mem
COSC5351 Advanced Computer Architecture
Also called symmetric multiprocessors (SMPs)
Large caches single memory can satisfy memory
Can scale to a few dozen processors by using a
Although scaling beyond that is technically
3/19/2012 8 COSC5351 Advanced Computer Architecture
3/19/2012 9 COSC5351 Advanced Computer Architecture
3/19/2012 10 COSC5351 Advanced Computer Architecture
3/19/2012 11 COSC5351 Advanced Computer Architecture
parallel parallel
3/19/2012 12
Ass ssume ume para rallel el
erati tion
se all ll proces
hers use e one proces
spee eedup up woul uld be number er of proces
80x with th 100 cpus us
parallel parallel
enhanced enhanced enhanced
parallel parallel
3/19/2012 13 COSC5351 Advanced Computer Architecture
CPI = Base CPI +
CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 No communication is 1.3/0.5 or 2.6x faster
3/19/2012 14
3/19/2012 15 COSC5351 Advanced Computer Architecture
3/19/2012 16 COSC5351 Advanced Computer Architecture
Processes accessing main memory may see very stale value
3/19/2012 17
I/O devices Memory P
1
$ $ $ P
2
P
3
5 u = ? 4 u = ?
1
2
3
COSC5351 Advanced Computer Architecture
1.
2.
3/19/2012 18
P Disk Memory L2 L1 100:34
100:35
100:67
This process should see value written immediately
COSC5351 Advanced Computer Architecture
1
2
COSC5351 Advanced Computer Architecture
Burak is meeting Lina at a restaurant and he arrives first
The tuna is sold out so they change the sign to Salmon Lina shows up and sees the Salmon Burak waits for Lina to decide, she say’s she’ll have the
What does Burak think she is ordering?
3/19/2012 20
1
2
COSC5351 Advanced Computer Architecture
Intuition not guaranteed by coherence Expect memory to respect order between accesses
Coherence is not enough!
3/19/2012 21
1
2
Mem P
1
P
n
Conceptual Picture
COSC5351 Advanced Computer Architecture
1.
2.
3.
3/19/2012 22 COSC5351 Advanced Computer Architecture
3/19/2012 COSC5351 Advanced Computer Architecture 23
3/19/2012 24 COSC5351 Advanced Computer Architecture
Programs on multiple processors will normally have
Rather than trying to avoid sharing in SW, SMPs use a
Migration - data can be moved to a local cache and
Replication – for shared data being simultaneously
3/19/2012 COSC5351 Advanced Computer Architecture 25
3/19/2012 26 COSC5351 Advanced Computer Architecture
Cache Controller “snoops” all transactions on
Either get exclusive access before write via
3/19/2012 27
State Address Data
I/O devices Mem P
1
$ Bus snoop $ P
n
Cache-memory transaction
COSC5351 Advanced Computer Architecture
Must invalidate before step 3 Write update uses more broadcast medium BW
3/19/2012 28
I/O devices Memory P
1
$ $ $ P
2
P
3
5 u = ? 4 u = ?
1
2
3
u = 7
COSC5351 Advanced Computer Architecture
1 P1 Read ad u 2 P3 Read u 3 P3 Wr Wr u=7 4 P1 Read u 5 P2 Read u
Cache block state transition diagram
Broadcast Medium Transactions (e.g., bus)
Broadcast medium enforces serialization of read
Also need to find up-to-date copy of cache
3/19/2012 29 COSC5351 Advanced Computer Architecture
3/19/2012 30 COSC5351 Advanced Computer Architecture
3/19/2012 31 COSC5351 Advanced Computer Architecture
3/19/2012 32 COSC5351 Advanced Computer Architecture
Every bus transaction must check the cache-
A way to reduce interference is to duplicate tags
Another way to reduce interference is to use L2
3/19/2012 33 COSC5351 Advanced Computer Architecture
Snooping coherence protocol is usually
Logically, think of a separate controller
In implementations, a single controller allows
3/19/2012 34 COSC5351 Advanced Computer Architecture
2 states per block in each cache
Writes invalidate all other cache
3/19/2012 35
State Tag Data
I/O devices Mem P
1
$ $ P
n
Bus
State Tag Data
COSC5351 Advanced Computer Architecture
What at happ ppens ens/What What we do
Processor only observes state of memory system by issuing
Assume bus transactions and memory operations are atomic
transaction
All writes go to bus + atomicity
=> invalidations applied to caches in bus order
How to insert reads in this order?
whether write serialization is satisfied
enter directly in bus order
Let’s understand other ordering issues
3/19/2012 36 COSC5351 Advanced Computer Architecture
Writes establish a partial order Doesn’t constrain ordering of reads, though
as long as in program order
3/19/2012 37
R W R R R R R R R R W R R R R R R R P
0:
P
1:
P
2:
COSC5351 Advanced Computer Architecture
Invalidation protocol, write-back cache
to the read request and aborts the memory access
Each memory block is in one state:
Each cache block is in one state (track these):
Read misses: cause all caches to snoop bus Writes to clean blocks are treated as misses Read miss in exclusive or shared state or write miss in the
3/19/2012 38 COSC5351 Advanced Computer Architecture
State machine
Non-resident
3/19/2012 39
COSC5351 Advanced Computer Architecture
State machine
3/19/2012 40
COSC5351 Advanced Computer Architecture
State machine
3/19/2012 41
COSC5351 Advanced Computer Architecture
State machine
3/19/2012 42
COSC5351 Advanced Computer Architecture
3/19/2012 43
Exclusive RW Shared RO
Invalid
COSC5351 Advanced Computer Architecture
CPU cache block bus cache block
3/19/2012 44 Rd hit Wr hit Exclusive RW Shared RO RdM s xRd Ms CPU Write xWrMs Rd hit RdMs WB block, xRdMs Wr xWrMs RdMs xRdMs WrMs WB block xWrMs Invalid WrMs WrMs WB block; (abort MA) RdMs WB Block; (abort MA) COSC5351 Advanced Computer Architecture
3/19/2012 45
COSC5351 Advanced Computer Architecture
CPU cache block bus cache block
Rd hit Wr hit Exclusive RW Shared RO RdM s xRd Ms CPU Write xWrMs Rd hit RdMs WB block, xRdMs Wr xWrMs RdMs xRdMs WrMs WB block xWrMs Invalid WrMs WrMs WB block; (abort MA) RdMs WB Block; (abort MA)
3/19/2012 46
COSC5351 Advanced Computer Architecture Rd hit Wr hit Exclusive RW Shared RO RdM s xRd Ms CPU Write xWrMs Rd hit RdMs WB block, xRdMs Wr xWrMs RdMs xRdMs WrMs WB block xWrMs Invalid WrMs WrMs WB block; (abort MA) RdMs WB Block; (abort MA)
CPU cache block bus cache block
3/19/2012 47
COSC5351 Advanced Computer Architecture Rd hit Wr hit Exclusive RW Shared RO RdM s xRd Ms CPU Write xWrMs Rd hit RdMs WB block, xRdMs Wr xWrMs RdMs xRdMs WrMs WB block xWrMs Invalid WrMs WrMs WB block; (abort MA) RdMs WB Block; (abort MA)
CPU cache block bus cache block
3/19/2012 48
COSC5351 Advanced Computer Architecture Rd hit Wr hit Exclusive RW Shared RO RdM s xRd Ms CPU Write xWrMs Rd hit RdMs WB block, xRdMs Wr xWrMs RdMs xRdMs WrMs WB block xWrMs Invalid WrMs WrMs WB block; (abort MA) RdMs WB Block; (abort MA)
CPU cache block bus cache block
3/19/2012 49
COSC5351 Advanced Computer Architecture Rd hit Wr hit Exclusive RW Shared RO RdM s xRd Ms CPU Write xWrMs Rd hit RdMs WB block, xRdMs Wr xWrMs RdMs xRdMs WrMs WB block xWrMs Invalid WrMs WrMs WB block; (abort MA) RdMs WB Block; (abort MA)
CPU cache block bus cache block
3/19/2012 50
COSC5351 Advanced Computer Architecture
Rd hit Wr hit Exclusive RW Shared RO RdM s xRd Ms CPU Write xWrMs Rd hit RdMs WB block, xRdMs Wr xWrMs RdMs xRdMs WrMs WB block xWrMs Invalid WrMs WrMs WB block; (abort MA) RdMs WB Block; (abort MA)
CPU cache block bus cache block
3/19/2012 51 COSC5351 Advanced Computer Architecture
3/19/2012 52 COSC5351 Advanced Computer Architecture
3/19/2012 53 COSC5351 Advanced Computer Architecture
3/19/2012 54 COSC5351 Advanced Computer Architecture
3/19/2012 55 COSC5351 Advanced Computer Architecture
3/19/2012 56
3/19/2012 CS252 s06 snooping cache MP 57
(Memory) Cycles per Instruction
3/19/2012 CS252 s06 snooping cache MP 58
(Memory) Cycles per Instruction
Provide set of states, state transition diagram,
Manage coherence protocol
(0) is done the same way on all systems
Different approaches distinguished by (a) to (c)
3/19/2012 59 COSC5351 Advanced Computer Architecture
All of (a), (b), (c) done through broadcast on bus
Could do it in scalable network too
Conceptually simple, but broadcast doesn’t
Scalable coherence:
3/19/2012 60 COSC5351 Advanced Computer Architecture
3/19/2012 61 COSC5351 Advanced Computer Architecture
shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}
caches that have the block; turn dirty-bit ON; turn p[i] ON; ... }
3/19/2012 62
P Cache Cache Memory Directory presence bits dirty bit Interconnection Network
COSC5351 Advanced Computer Architecture
Similar to Snoopy Protocol: Three states
In addition to cache state, must track which
Keep it simple(r):
3/19/2012 63 COSC5351 Advanced Computer Architecture
3/19/2012 64 COSC5351 Advanced Computer Architecture
Read miss Local cache Home directory P, A
Write miss Local cache Home directory P, A
Invalidate Home directory Remote caches A
Fetch Home directory Remote cache A
Fetch/Invalidate Home directory Remote cache A
Data value reply Home directory Local cache Data
Data write back Remote cache Home directory A, Data
3/19/2012 65 COSC5351 Advanced Computer Architecture
States identical to snoopy case;
Transitions caused by read misses, write
Generates read miss & write miss msg to
Write misses that were broadcast on the
Note: on a write, a cache block is bigger,
3/19/2012 66 COSC5351 Advanced Computer Architecture
State machine for CPU requests for each memory block
Invalid state if in memory
3/19/2012 67
COSC5351 Advanced Computer Architecture
3/19/2012 68 COSC5351 Advanced Computer Architecture
State machine for Directory requests for each memory block
Uncached state if in memory
3/19/2012 69
COSC5351 Advanced Computer Architecture
Message sent to directory causes two actions:
Block is in Uncached state: the copy in memory is the
Block is Shared => the memory value is up-to-date:
3/19/2012 70 COSC5351 Advanced Computer Architecture
Block is Exclusive: current value of the block is held in the
3/19/2012 71 COSC5351 Advanced Computer Architecture
3/19/2012 72
P1 P2 Bus Directory Memory step State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 40 to A2
COSC5351 Advanced Computer Architecture
P2: Write 20 to A1
3/19/2012 73
P1 P2 Bus Directory Memory step State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 P1: Read A1 P2: Read A1 P2: Write 40 to A2
COSC5351 Advanced Computer Architecture
P2: Write 20 to A1
3/19/2012 74
P1 P2 Bus Directory Memory step State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 40 to A2
COSC5351 Advanced Computer Architecture
P2: Write 20 to A1
3/19/2012 75
P1 P2 Bus Directory Memory step State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1
RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 10
10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10 10 10 P2: Write 40 to A2 10
A1 Write Back A1
COSC5351 Advanced Computer Architecture
P2: Write 20 to A1
3/19/2012 76
P1 P2 Bus Directory Memory step State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1
RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 10
10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 10
A1 A1
COSC5351 Advanced Computer Architecture
P2: Write 20 to A1
3/19/2012 77
P2: Write 20 to A1
P1 P2 Bus Directory Memory step State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1
RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 10
10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} WrBk P2 A1 20 A1 Unca. {} 20
40 DaRp P2 A2 A2
A1 A1
COSC5351 Advanced Computer Architecture
3/19/2012 CS252 s06 snooping cache MP 78
“End” of uniprocessors speedup =>
Parallelism challenges: % parallalizable, long
Centralized vs. distributed memory
Message Passing vs. Shared Address
Snooping cache over shared medium for smaller
Sharing cached data Coherence (values
Shared medium serializes writes
3/19/2012 79 COSC5351 Advanced Computer Architecture
Caches contain all information on state of
Snooping cache over shared medium for
Sharing cached data Coherence (values
Snooping and Directory Protocols similar; bus
Directory has extra data structure to keep
3/19/2012 80 COSC5351 Advanced Computer Architecture