CS654 Advanced Computer Architecture Lec 12 – Vector Wrap-up and Multiprocessor Introduction Peter Kemper
Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley
CS654 Advanced Computer Architecture Lec 12 Vector Wrap-up and - - PowerPoint PPT Presentation
CS654 Advanced Computer Architecture Lec 12 Vector Wrap-up and Multiprocessor Introduction Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California,
Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley
3/25/09 W&M CS654 2
3/25/09 W&M CS654 3
3/25/09 W&M CS654 4
(from F. Quintana, U. Barcelona.)
3/25/09 W&M CS654 5
3/25/09 W&M CS654 6
3/25/09 W&M CS654 7 32
3/25/09 W&M CS654 8
Addr Mod 8 = 0 Addr Mod 8 = 1 Addr Mod 8 = 2 Addr Mod 8 = 4 Addr Mod 8 = 5 Addr Mod 8 = 3 Addr Mod 8 = 6 Addr Mod 8 = 7
3/25/09 W&M CS654 9
3/25/09 W&M CS654 10
(Torrent-0 vector microprocessor, 1995)
3/25/09 W&M CS654 11
3/25/09 W&M CS654 12
3/25/09 W&M CS654 13
3/25/09 W&M CS654 14
3/25/09 W&M CS654 15
3/25/09 W&M CS654 16
– Much larger register set (32x64 vector, 64+64 scalar) – 64- and 32-bit memory and IEEE arithmetic – Based on 25 years of experience compiling with Cray1 ISA
– Scalar unit runs ahead of vector unit, doing addressing and control – Hardware dynamically unrolls loops, and issues multiple loops concurrently – Special sync operations keep pipeline full, even across barriers ⇒ Allows the processor to perform well on short nested loops
– Memory hierarchy: caches, local memory, remote memory – Low latency, load/store access to entire machine (tens of TBs) – Processors support 1000’s of outstanding refs with flexible addressing – Very high bandwidth network – Coherence protocol, addressing and synchronization optimized for DM
3/25/09 W&M CS654 17
3/25/09 W&M CS654 18
From Horst D. Simon, NERSC/LBNL, May 15, 2002, “ESS Rapid Response Meeting” ES: Earth Simulator
3/25/09 W&M CS654 19
3/25/09 W&M CS654 20
3/25/09 W&M CS654 21
3/25/09 W&M CS654 22
3/25/09 W&M CS654 23
3/25/09 W&M CS654 24
3/25/09 W&M CS654 25
3/25/09 W&M CS654 26
1 10 100 1000 10000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Performance (vs. VAX-11/780)
25%/year 52%/year ??%/year
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
3/25/09 W&M CS654 27
David Mitchell, The Transputer: The Time Is Now (1989)
Paul Otellini, President, Intel (2005)
Threads/Processor
Processors/chip
Manufacturer/Year
3/25/09 W&M CS654 28
3/25/09 W&M CS654 29
M.J. Flynn, "Very High-Speed Computers",
3/25/09 W&M CS654 30
3/25/09 W&M CS654 31
P
1
$
Inter connection network $ P
n
Mem Mem P
1
$ Inter connection network $ P
n
Mem Mem
3/25/09 W&M CS654 32
3/25/09 W&M CS654 33
3/25/09 W&M CS654 34
3/25/09 W&M CS654 35
3/25/09 W&M CS654 36
parallel parallel parallel parallel parallel parallel parallel parallel parallel enhanced
3/25/09 W&M CS654 37
3/25/09 W&M CS654 38
3/25/09 W&M CS654 39
3/25/09 W&M CS654 40
3/25/09 W&M CS654 41
I/O devices Memory P
1
$ $ $ P
2
P
3
5 u = ? 4 u = ?
1
2
3
3/25/09 W&M CS654 42
Mem P
1
P
n
Conceptual Picture
3/25/09 W&M CS654 43 P Disk Memory L2 L1 100:34
100:35
100:67
3/25/09 W&M CS654 44
3/25/09 W&M CS654 45
3/25/09 W&M CS654 46
3/25/09 W&M CS654 47
3/25/09 W&M CS654 48
State Address Data
I/O devices Mem P
1
$ Bus snoop $ P
n
Cache-memory transaction
3/25/09 W&M CS654 49
I/O devices Memory P
1
$ $ $ P
2
P
3
5 u = ? 4 u = ?
1
2
3
u = 7
3/25/09 W&M CS654 50
3/25/09 W&M CS654 51
3/25/09 W&M CS654 52
3/25/09 W&M CS654 53
3/25/09 W&M CS654 54
3/25/09 W&M CS654 55
3/25/09 W&M CS654 56
State Tag Data
I/O devices Mem P
1
$ $ P
n
Bus
State Tag Data
3/25/09 W&M CS654 57
– all phases of one bus transaction complete before next one starts – processor waits for memory operation to complete before issuing next – with one-level cache, assume invalidations applied during bus transaction
– Writes serialized by order in which they appear on bus (bus order) => invalidations applied to caches in bus order
– Important since processors see writes through reads, so determines whether write serialization is satisfied – But read hits may happen independently and do not appear on bus or enter directly in bus order
3/25/09 W&M CS654 58
– any order among reads between writes is fine,
as long as in program order
R W R R R R R R R R W R R R R R R R P
0:
P
1:
P
2:
3/25/09 W&M CS654 59
– Snoops every address on bus – If it has a dirty copy of requested block, provides that block in response to the read request and aborts the memory access
– Clean in all caches and up-to-date in memory (Shared) – OR Dirty in exactly one cache (Exclusive) – OR Not in any caches
– Shared : block can be read – OR Exclusive : cache has only copy, its writeable, and dirty – OR Invalid : block contains no data (in uniprocessor cache too)
3/25/09 W&M CS654 60
3/25/09 W&M CS654 61
3/25/09 W&M CS654 62
3/25/09 W&M CS654 63
3/25/09 W&M CS654 64
P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2
3/25/09 W&M CS654 65
P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2
3/25/09 W&M CS654 66
P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2
3/25/09 W&M CS654 67
P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2
3/25/09 W&M CS654 68
P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2
3/25/09 W&M CS654 69
P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 WrMs P2 A2 A1 10 Excl. A2 40 WrBk P2 A1 20 A1 20 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2
3/25/09 W&M CS654 70
3/25/09 W&M CS654 71
– Small MP vs. lower latency, larger BW for Larger MP
– Uniform access time vs. Non-uniform access time