Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides CDC6600 (Cray, 1964) regarded as first supercomputer COSC5351 Advanced Computer Architecture 10/3/2011 2 Epitomized by Cray-1, 1976: Typical application areas • Military research (nuclear weapons, cryptography) Scalar Unit + Vector Extensions • Scientific research Load/Store Architecture • Weather forecasting Vector Registers • Oil exploration • Industrial design (car crash simulation) Vector Instructions Hardwired Control All involve huge computations on large data sets Highly Pipelined Functional Units In 70s-80s, Supercomputer Vector Machine Interleaved Memory System No Data Caches No Virtual Memory COSC5351 Advanced Computer COSC5351 Advanced Computer 3 4 Architecture 10/3/2011 Architecture 10/3/2011 V0 V i V. Mask V1 V j 64 Element V2 V3 V. Length V k Vector Registers V4 Single Port V5 V6 Memory V7 FP Add S j FP Mul S0 16 banks of ( (A h ) + j k m ) S1 S k FP Recip S2 64-bit words S i S3 64 (A 0 ) S4 S i Int Add + T jk S5 T Regs 8-bit SECDED S6 Int Logic S7 Int Shift A0 80MW/sec data ( (A h ) + j k m ) A1 Pop Cnt load/store A2 A j A i A3 64 (A 0 ) A4 A k Addr Add B jk A5 A i B Regs 320MW/sec Addr Mul A6 A7 instruction buffer refill NIP CIP 64-bitx16 LIP 4 Instruction Buffers memory bank cycle 50 ns processor cycle 12.5 ns (80MHz) COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 5 Architecture 10/3/2011 6 NOW Handout Page 1 1
Vector Programming Model Scalar Registers Vector Registers r15 v15 # Scalar Code # Vector Code # C code LI R4, 64 LI VLR, 64 for (i=0; i<64; i++) loop: LV V1, R1 r0 v0 C[i] = A[i] + B[i]; [0] [1] [2] [VLRMAX-1] L.D F0, 0(R1) LV V2, R2 L.D F2, 0(R2) ADDV.D V3, V1, V2 Vector Length Register VLR ADD.D F4, F2, F0 SV V3, R3 v1 S.D F4, 0(R3) Vector Arithmetic v2 DADDIU R1, 8 Instructions + + + + + + DADDIU R2, 8 ADDV v3, v1, v2 v3 DADDIU R3, 8 [0] [1] [VLR-1] DSUBIU R4, 1 Vector Load and BNEZ R4, loop Vector Register v1 Store Instructions LV v1, r1, r2 Memory Base, r1 Stride, r2 COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 7 Architecture 10/3/2011 8 Compact ◦ one short instruction encodes N operations • Use deep pipeline (=> fast clock) Expressive, tells hardware that these N to execute element operations V V V operations: • Simplifies control of deep pipeline 1 2 3 ◦ are independent because elements in vector are ◦ use the same functional unit independent (=> no hazards!) ◦ access disjoint registers ◦ access registers in the same pattern as previous instructions ◦ access a contiguous block of memory (unit-stride load/store) ◦ access memory in a known pattern (strided load/store) Six stage multiply pipeline Scalable ◦ can run same object code on more parallel pipelines or lanes V3 <- v1 * v2 COSC5351 Advanced Computer COSC5351 Advanced Computer 9 10 Architecture 10/3/2011 Architecture 10/3/2011 ADDV C,A,B Execution using Execution using Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency one pipelined four pipelined functional unit functional units • Bank busy time : Cycles between accesses to same bank Base Stride Vector Registers A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] Address A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] Generator + A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] 0 1 2 3 4 5 6 7 8 9 A B C D E F C[0] C[0] C[1] C[2] C[3] Memory Banks COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 11 Architecture 10/3/2011 12 NOW Handout Page 2 2
Functional Unit Vector Vector register Lane Registers Elements Elements Elements Elements elements striped 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, … over lanes [24] [25] [26] [27] [28] [29] [30] [31] [16] [17] [18] [19] [20] [21] [22] [23] [8] [9] [10] [11] [12] [13] [14] [15] [0] [1] [2] [3] [4] [5] [6] [7] Lane Memory Subsystem COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 13 Architecture 10/3/2011 14 Vector memory-memory instructions hold all vector operands in main memory Vector memory-memory architectures (VMMA) The first vector machines, CDC Star- 100 („73) and TI ASC („71), were require greater main memory bandwidth, why? memory-memory machines Cray- 1 (‟76) was first vector register machine ◦ All operands must be read in and out of memory VMMAs make if difficult to overlap execution of Vector Memory-Memory Code multiple vector operations, why? Example Source Code ADDV C, A, B ◦ Must check dependencies on memory addresses SUBV D, A, B for (i=0; i<N; i++) VMMAs incur greater startup latency { Vector Register Code C[i] = A[i] + B[i]; ◦ Scalar code was faster on CDC Star-100 for vectors < 100 elements Do VMMAs have e any advanta tages ges? D[i] = A[i] - B[i]; ◦ For Cray-1, vector/scalar breakeven point was around 2 elements LV V1, A } LV V2, B Apart from CDC follow-ons (Cyber-205, ETA-10) all ADDV V3, V1, V2 major vector machines since Cray-1 have had vector SV V3, C register architectures SUBV V4, V1, V2 (we ignore vector memory-memory from now on) SV V4, D COSC5351 Advanced Computer COSC5351 Advanced Computer 15 16 Architecture 10/3/2011 Architecture 10/3/2011 Problem: Vector registers have finite length for (i=0; i < N; i++) Solution: Break loops into pieces that fit into C[i] = A[i] + B[i]; vector registers, “Stripmining” Vectorized Code Scalar Sequential Code ANDI R1, N, 63 # N mod 64 load load load MTC1 VLR, R1 # Do remainder for (i=0; i<N; i++) loop: C[i] = A[i]+B[i]; load Iter. 1 load load Time LV V1, RA A B C DSLL R2, R1, 3 # Multiply by 8 add add add Remainder + DADDU RA, RA, R2 # Bump pointer LV V2, RB store store store DADDU RB, RB, R2 load + 64 elements ADDV.D V3, V1, V2 Iter. Iter. SV V3, RC Vector Instruction load 1 2 Iter. 2 DADDU RC, RC, R2 Vectorization is a massive compile-time DSUBU N, N, R1 # Subtract elements add reordering of operation sequencing + LI R1, 64 requires extensive loop dependence analysis MTC1 VLR, R1 # Reset full length store COSC5351 Advanced Computer BGTZ N, loop # Any more to do? Architecture 10/3/2011 17 NOW Handout Page 3 3
Vector version of register bypassing Can overlap ap executi tion of multi tipl ple vector r instru truct ctions ◦ example machine has 32 elements per vector register and 8 lanes ◦ introduced with Cray-1 Load Unit Multiply Unit Add Unit load V V V V V mul LV v1 1 2 3 4 5 add MULV v3,v1,v2 time load ADDV v5, v3, v4 mul Chain Chain add Load Unit Mult. Add Instruction Memory issue Complete 24 operations/cycle while issuing 1 short instruction/cycle COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 19 Architecture 10/3/2011 20 Two components of vector startup penalty ◦ functional unit latency (time through pipeline) ◦ dead time or recovery time (time before another vector instruction can • Without chaining, must wait for last element of result to start down pipeline) be written before starting dependent instruction Functional Unit Latency Load R X X X W Mul R X X X W First Vector Instruction Time Add R X X X W R X X X W • With chaining, can start dependent instruction as soon R X X X W Dead Time as first result appears R X X X W R X X X W Load R X X X W Mul Dead Time Second Vector Instruction R X X X W Add R X X X W COSC5351 Advanced Computer COSC5351 Advanced Computer 21 22 Architecture 10/3/2011 Architecture 10/3/2011 Want to vectorize loops with indirect accesses: for (i=0; i<N; i++) No dead time A[i] = B[i] + C[D[i]] Indexed load instruction ( Gather ) 4 cycles dead time T0 (Berkeley), Eight lanes LV vD, rD # Load indices in D vector No dead time LVI vC, rC, vD # Load indirect from rC base 100% efficiency with 8 element LV vB, rB # Load B vector vectors ADDV.D vA, vB, vC # Do add SV vA, rA # Store result 64 cycles active Cray C90, Two lanes 4 cycle dead time Maximum efficiency 94% with 128 element vectors COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 23 Architecture 10/3/2011 24 NOW Handout Page 4 4
Recommend
More recommend