COSC 5351 Advanced Computer Architecture
Slides modified from Hennessy CS252 course slides
COSC 5351 Advanced Computer Architecture Slides modified from - - PowerPoint PPT Presentation
COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any
Slides modified from Hennessy CS252 course slides
10/3/2011 2 COSC5351 Advanced Computer Architecture
10/3/2011 3
COSC5351 Advanced Computer Architecture
10/3/2011 4 COSC5351 Advanced Computer Architecture
10/3/2011 5 COSC5351 Advanced Computer Architecture
10/3/2011 6
64-bitx16 NIP LIP CIP (A0) ( (Ah) + j k m )
(A0) ( (Ah) + j k m )
S0 S1 S2 S3 S4 S5 S6 S7 A0 A1 A2 A3 A4 A5 A6 A7
Si Tjk Ai Bjk FP Add FP Mul FP Recip Int Add Int Logic Int Shift Pop Cnt Sj Si Sk Addr Add Addr Mul Aj Ai Ak
V0 V1 V2 V3 V4 V5 V6 V7
Vk Vj Vi
COSC5351 Advanced Computer Architecture
10/3/2011 7
+ + + + + +
Scalar Registers
Vector Registers
Vector Length Register
COSC5351 Advanced Computer Architecture
10/3/2011 8
COSC5351 Advanced Computer Architecture
10/3/2011 9 COSC5351 Advanced Computer Architecture
10/3/2011 COSC5351 Advanced Computer Architecture 10
10/3/2011 COSC5351 Advanced Computer Architecture 11
10/3/2011 12
C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using
functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units
COSC5351 Advanced Computer Architecture
10/3/2011 13
Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, …
COSC5351 Advanced Computer Architecture
10/3/2011 14
COSC5351 Advanced Computer Architecture
Vector memory-memory instructions hold all vector operands in main
The first vector machines, CDC Star-100 („73) and TI ASC („71), were
Cray-1 (‟76) was first vector register machine
10/3/2011 15
COSC5351 Advanced Computer Architecture
Vector memory-memory architectures (VMMA)
VMMAs make if difficult to overlap execution of
VMMAs incur greater startup latency
Apart from CDC follow-ons (Cyber-205, ETA-10) all
10/3/2011 16 COSC5351 Advanced Computer Architecture
10/3/2011 17
Vector Instruction
COSC5351 Advanced Computer Architecture
10/3/2011 19
Instruction issue
COSC5351 Advanced Computer Architecture
10/3/2011 20
COSC5351 Advanced Computer Architecture
10/3/2011 COSC5351 Advanced Computer Architecture 21
start down pipeline)
10/3/2011 22
R X X X W R X X X W R X X X W R X X X W R X X X W R X X X W R X X X W R X X X W R X X X W R X X X W
Functional Unit Latency Dead Time First Vector Instruction Second Vector Instruction Dead Time
COSC5351 Advanced Computer Architecture
10/3/2011 23
Cray C90, Two lanes 4 cycle dead time Maximum efficiency 94% with 128 element vectors 4 cycles dead time T0 (Berkeley), Eight lanes No dead time 100% efficiency with 8 element vectors No dead time 64 cycles active
COSC5351 Advanced Computer Architecture
10/3/2011 24 COSC5351 Advanced Computer Architecture
10/3/2011 25 COSC5351 Advanced Computer Architecture
10/3/2011 26
COSC5351 Advanced Computer Architecture
10/3/2011 COSC5351 Advanced Computer Architecture 27
C[4] C[5] C[1] Write data port A[7] B[7] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1
C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 Write data port Write Enable A[7] B[7] M[7]=1
Compress packs non-masked elements from one vector
Expand performs inverse operation
10/3/2011 28
M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 A[3] A[4] A[5] A[6] A[7] A[0] A[1] A[2] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 B[3] A[4] A[5] B[6] A[7] B[0] A[1] B[2]
A[7] A[1] A[4] A[5]
A[7] A[1] A[4] A[5] Used for density-time conditionals and also for general selection operations
COSC5351 Advanced Computer Architecture
10/3/2011 29 COSC5351 Advanced Computer Architecture
CMOS Technology
Scalar unit
execution
Vector unit
elements/VReg)
mask unit
SMP structure
10/3/2011 30 COSC5351 Advanced Computer Architecture
Very short vectors added to existing ISAs for micros Usually 64-bit registers split into 2x32b or 4x16b or
Newer designs have 128-bit registers (Altivec, SSE2)
Limited instruction set:
Limited vector register length:
Trend towards fuller vector support in
10/3/2011 COSC5351 Advanced Computer Architecture 31
Each result independent of previous result
Vector instructions access memory with known
Reduces branches and branch problems in pipelines Single vector instruction implies lots of work ( loop)
10/3/2011 32 COSC5351 Advanced Computer Architecture
10/3/2011 COSC5351 Advanced Computer Architecture 33
10/3/2011 34 COSC5351 Advanced Computer Architecture
Time = f(vector length, data dependicies, struct.
Initiation rate: rate that FU consumes vector
Convoy: set of vector instructions that can begin
Chime: approx. time for a vector operation m convoys take m chimes; if each vector length is n,
10/3/2011 35
COSC5351 Advanced Computer Architecture
Load/store operations move groups of data
Three types of addressing
10/3/2011 36 COSC5351 Advanced Computer Architecture
Great for unit stride:
What about non-unit stride?
10/3/2011 37
Addr Mod 8 = 0 Addr Mod 8 = 1 Addr Mod 8 = 2 Addr Mod 8 = 4 Addr Mod 8 = 5 Addr Mod 8 = 3 Addr Mod 8 = 6 Addr Mod 8 = 7
COSC5351 Advanced Computer Architecture
10/3/2011 38
COSC5351 Advanced Computer Architecture
One inst fetch, decode,
Structured register accesses Smaller code for high
Bypass cache One TLB lookup per
Move only necessary data
10/3/2011 39
COSC5351 Advanced Computer Architecture
Control logic grows
Vector unit switches
Vector instructions
Software control of
10/3/2011 40 COSC5351 Advanced Computer Architecture
Multimedia Processing (compress., graphics, audio synth, image
Standard benchmark kernels (Matrix Multiply, FFT,
Lossy Compression (JPEG, MPEG video and audio) Lossless Compression (Zero removal, RLE, Differencing, LZW) Cryptography (RSA, DES/IDEA, SHA/MD5) Speech and handwriting recognition Operating systems/Networking (memcpy, memset, parity,
Databases (hash/join, data mining, image/video serving) Language run-time support (stdlib, garbage collection) even SPECint95
10/3/2011 41 COSC5351 Advanced Computer Architecture
10/3/2011 42 COSC5351 Advanced Computer Architecture
10/3/2011 43 COSC5351 Advanced Computer Architecture
Allows the processor to perform well on short nested loops
10/3/2011 44 COSC5351 Advanced Computer Architecture
1.
10/3/2011 45
From Horst D. Simon, NERSC/LBNL, May 15, 2002, “ESS Rapid Response Meeting”
COSC5351 Advanced Computer Architecture
10/3/2011 46 COSC5351 Advanced Computer Architecture
10/3/2011 47 COSC5351 Advanced Computer Architecture
10/3/2011 48 COSC5351 Advanced Computer Architecture
Vector is alternative model for exploiting ILP If code is vectorizable, then simpler hardware,
Design issues include number of lanes, number
Fundamental design issue is memory
10/3/2011 49 COSC5351 Advanced Computer Architecture
10/3/2011 COSC5351 Advanced Computer Architecture 50
lops
flops Nmax
#68Earth Simulator - Japan Agency for Marine -
#1 K Computer (2011)
#3 Jaguar – Oak Ridge NL (2009)
No of vector computer in top 500? Comparison
10/3/2011 COSC5351 Advanced Computer Architecture 51
lops
flops Nmax
10/3/2011 COSC5351 Advanced Computer Architecture 52
10/3/2011 COSC5351 Advanced Computer Architecture 53
10/3/2011 COSC5351 Advanced Computer Architecture 54
10/3/2011 55 COSC5351 Advanced Computer Architecture