now handout page 1
play

NOW Handout Page 1 1 Vector Programming Model Scalar Registers - PDF document

Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray COSC 5351 Advanced Computer


  1. Definition of a supercomputer:  Fastest machine in world at given task  A device to turn a compute-bound problem into an I/O bound problem  Any machine costing $30M+  Any machine designed by Seymour Cray COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides CDC6600 (Cray, 1964) regarded as first supercomputer COSC5351 Advanced Computer Architecture 10/3/2011 2 Epitomized by Cray-1, 1976: Typical application areas • Military research (nuclear weapons, cryptography) Scalar Unit + Vector Extensions • Scientific research  Load/Store Architecture • Weather forecasting  Vector Registers • Oil exploration • Industrial design (car crash simulation)  Vector Instructions  Hardwired Control All involve huge computations on large data sets  Highly Pipelined Functional Units In 70s-80s, Supercomputer  Vector Machine  Interleaved Memory System  No Data Caches  No Virtual Memory COSC5351 Advanced Computer COSC5351 Advanced Computer 3 4 Architecture 10/3/2011 Architecture 10/3/2011 V0 V i V. Mask V1 V j 64 Element V2 V3 V. Length V k Vector Registers V4 Single Port V5 V6 Memory V7 FP Add S j FP Mul S0 16 banks of ( (A h ) + j k m ) S1 S k FP Recip S2 64-bit words S i S3 64 (A 0 ) S4 S i Int Add + T jk S5 T Regs 8-bit SECDED S6 Int Logic S7 Int Shift A0 80MW/sec data ( (A h ) + j k m ) A1 Pop Cnt load/store A2 A j A i A3 64 (A 0 ) A4 A k Addr Add B jk A5 A i B Regs 320MW/sec Addr Mul A6 A7 instruction buffer refill NIP CIP 64-bitx16 LIP 4 Instruction Buffers memory bank cycle 50 ns processor cycle 12.5 ns (80MHz) COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 5 Architecture 10/3/2011 6 NOW Handout Page 1 1

  2. Vector Programming Model Scalar Registers Vector Registers r15 v15 # Scalar Code # Vector Code # C code LI R4, 64 LI VLR, 64 for (i=0; i<64; i++) loop: LV V1, R1 r0 v0 C[i] = A[i] + B[i]; [0] [1] [2] [VLRMAX-1] L.D F0, 0(R1) LV V2, R2 L.D F2, 0(R2) ADDV.D V3, V1, V2 Vector Length Register VLR ADD.D F4, F2, F0 SV V3, R3 v1 S.D F4, 0(R3) Vector Arithmetic v2 DADDIU R1, 8 Instructions + + + + + + DADDIU R2, 8 ADDV v3, v1, v2 v3 DADDIU R3, 8 [0] [1] [VLR-1] DSUBIU R4, 1 Vector Load and BNEZ R4, loop Vector Register v1 Store Instructions LV v1, r1, r2 Memory Base, r1 Stride, r2 COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 7 Architecture 10/3/2011 8  Compact ◦ one short instruction encodes N operations • Use deep pipeline (=> fast clock)  Expressive, tells hardware that these N to execute element operations V V V operations: • Simplifies control of deep pipeline 1 2 3 ◦ are independent because elements in vector are ◦ use the same functional unit independent (=> no hazards!) ◦ access disjoint registers ◦ access registers in the same pattern as previous instructions ◦ access a contiguous block of memory (unit-stride load/store) ◦ access memory in a known pattern (strided load/store) Six stage multiply pipeline  Scalable ◦ can run same object code on more parallel pipelines or lanes V3 <- v1 * v2 COSC5351 Advanced Computer COSC5351 Advanced Computer 9 10 Architecture 10/3/2011 Architecture 10/3/2011 ADDV C,A,B Execution using Execution using Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency one pipelined four pipelined functional unit functional units • Bank busy time : Cycles between accesses to same bank Base Stride Vector Registers A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] Address A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] Generator + A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] 0 1 2 3 4 5 6 7 8 9 A B C D E F C[0] C[0] C[1] C[2] C[3] Memory Banks COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 11 Architecture 10/3/2011 12 NOW Handout Page 2 2

  3. Functional Unit Vector Vector register Lane Registers Elements Elements Elements Elements elements striped 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, … over lanes [24] [25] [26] [27] [28] [29] [30] [31] [16] [17] [18] [19] [20] [21] [22] [23] [8] [9] [10] [11] [12] [13] [14] [15] [0] [1] [2] [3] [4] [5] [6] [7] Lane Memory Subsystem COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 13 Architecture 10/3/2011 14  Vector memory-memory instructions hold all vector operands in main memory  Vector memory-memory architectures (VMMA)  The first vector machines, CDC Star- 100 („73) and TI ASC („71), were require greater main memory bandwidth, why? memory-memory machines  Cray- 1 (‟76) was first vector register machine ◦ All operands must be read in and out of memory  VMMAs make if difficult to overlap execution of Vector Memory-Memory Code multiple vector operations, why? Example Source Code ADDV C, A, B ◦ Must check dependencies on memory addresses SUBV D, A, B for (i=0; i<N; i++)  VMMAs incur greater startup latency { Vector Register Code C[i] = A[i] + B[i]; ◦ Scalar code was faster on CDC Star-100 for vectors < 100 elements Do VMMAs have e any advanta tages ges? D[i] = A[i] - B[i]; ◦ For Cray-1, vector/scalar breakeven point was around 2 elements LV V1, A } LV V2, B  Apart from CDC follow-ons (Cyber-205, ETA-10) all ADDV V3, V1, V2 major vector machines since Cray-1 have had vector SV V3, C register architectures SUBV V4, V1, V2 (we ignore vector memory-memory from now on) SV V4, D COSC5351 Advanced Computer COSC5351 Advanced Computer 15 16 Architecture 10/3/2011 Architecture 10/3/2011 Problem: Vector registers have finite length for (i=0; i < N; i++) Solution: Break loops into pieces that fit into C[i] = A[i] + B[i]; vector registers, “Stripmining” Vectorized Code Scalar Sequential Code ANDI R1, N, 63 # N mod 64 load load load MTC1 VLR, R1 # Do remainder for (i=0; i<N; i++) loop: C[i] = A[i]+B[i]; load Iter. 1 load load Time LV V1, RA A B C DSLL R2, R1, 3 # Multiply by 8 add add add Remainder + DADDU RA, RA, R2 # Bump pointer LV V2, RB store store store DADDU RB, RB, R2 load + 64 elements ADDV.D V3, V1, V2 Iter. Iter. SV V3, RC Vector Instruction load 1 2 Iter. 2 DADDU RC, RC, R2 Vectorization is a massive compile-time DSUBU N, N, R1 # Subtract elements add reordering of operation sequencing + LI R1, 64  requires extensive loop dependence analysis MTC1 VLR, R1 # Reset full length store COSC5351 Advanced Computer BGTZ N, loop # Any more to do? Architecture 10/3/2011 17 NOW Handout Page 3 3

  4.  Vector version of register bypassing Can overlap ap executi tion of multi tipl ple vector r instru truct ctions ◦ example machine has 32 elements per vector register and 8 lanes ◦ introduced with Cray-1 Load Unit Multiply Unit Add Unit load V V V V V mul LV v1 1 2 3 4 5 add MULV v3,v1,v2 time load ADDV v5, v3, v4 mul Chain Chain add Load Unit Mult. Add Instruction Memory issue Complete 24 operations/cycle while issuing 1 short instruction/cycle COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 19 Architecture 10/3/2011 20 Two components of vector startup penalty ◦ functional unit latency (time through pipeline) ◦ dead time or recovery time (time before another vector instruction can • Without chaining, must wait for last element of result to start down pipeline) be written before starting dependent instruction Functional Unit Latency Load R X X X W Mul R X X X W First Vector Instruction Time Add R X X X W R X X X W • With chaining, can start dependent instruction as soon R X X X W Dead Time as first result appears R X X X W R X X X W Load R X X X W Mul Dead Time Second Vector Instruction R X X X W Add R X X X W COSC5351 Advanced Computer COSC5351 Advanced Computer 21 22 Architecture 10/3/2011 Architecture 10/3/2011 Want to vectorize loops with indirect accesses: for (i=0; i<N; i++) No dead time A[i] = B[i] + C[D[i]] Indexed load instruction ( Gather ) 4 cycles dead time T0 (Berkeley), Eight lanes LV vD, rD # Load indices in D vector No dead time LVI vC, rC, vD # Load indirect from rC base 100% efficiency with 8 element LV vB, rB # Load B vector vectors ADDV.D vA, vB, vC # Do add SV vA, rA # Store result 64 cycles active Cray C90, Two lanes 4 cycle dead time Maximum efficiency 94% with 128 element vectors COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/3/2011 23 Architecture 10/3/2011 24 NOW Handout Page 4 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend