1
play

1 Components of a Vector Processor Cray- 1 Block Scalar CPU: - PDF document

Vector Processors CS252 I nit ially developed f or super -comput ing applicat ions, Graduate Computer Architecture t oday impor t ant f or mult imedia. Lecture 20 Vect or pr ocessor s have high- level oper at ions t hat Vector


  1. Vector Processors CS252 • I nit ially developed f or super -comput ing applicat ions, Graduate Computer Architecture t oday impor t ant f or mult imedia. Lecture 20 • Vect or pr ocessor s have high- level oper at ions t hat Vector Processing => Multimedia wor k on linear ar r ays of number s: "vect or s" VECTOR SCALAR (N operations) David E. Culler (1 operation) r1 r2 v1 v2 + + Many slides due to Christ of oros E. Kozyrakis r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 CS252/ Culler Lec 20. 2 4/ 9/ 02 Properties of Vector Processors Styles of Vector Architectures • Single vect or inst r uct ion implies lot s of wor k (loop) • Memory- memor y vect or pr ocessor s – Fewer inst ruct ion f et ches – All vect or operat ions are memory t o memory • Each r esult independent of pr evious r esult • Vect or-r egist er pr ocessor s – Mult iple operat ions can be execut ed in parallel – All vect or operat ions bet ween vect or regist ers (except vect or load and st ore) – Simpler design, high clock rat e – Vect or equivalent of load- st ore archit ect ures – Compiler (programmer) ensures no dependencies – I ncludes all vect or machines since lat e 1980s • Reduces br anches and br anch pr oblems in pipelines – We assume vect or-regist er f or rest of t he lect ure • Vect or inst r uct ions access memor y wit h known pat t ern – Ef f ect ive pref et ching – Amort ize memory lat ency of over large number of element s – Can exploit a high bandwidt h memory syst em – No (dat a) caches required! CS252/ Culler CS252/ Culler Lec 20. 3 Lec 20. 4 4/ 9/ 02 4/ 9/ 02 Historical Perspective Cray- 1 Breakthrough • Mid-60s f ear per f . st agnat es • Fast , simple scalar processor – 80 MHz! • SI MD pr ocessor ar r ays – single-phase, lat ches act ively developed dur ing lat e • Exquisit e elect rical and mechanical design 60’s – mid 70’s • Semiconduct or memory – bit - parallel machines f or image processing • Vect or regist er concept • pepe, st aran, mpp – vast simplif icat ion of inst ruct ion set – wor d- parallel f or scient if ic – r educed necc. memory bandwidt h • I lliac I V • Tight int egrat ion of vect or and scalar • Cr ay develops f ast scalar • P iggy- back of f 7600 st acklib – CDC 6600, 7600 • Lat er vect orizing compilers developed • CDC bet s of vect or s wit h • Owned high- perf ormance comput ing f or a decade St ar-100 – what happened t hen? • Amdahl ar gues against vect or – VLI W compet it ion CS252/ Culler CS252/ Culler Lec 20. 5 Lec 20. 6 4/ 9/ 02 4/ 9/ 02 1

  2. Components of a Vector Processor Cray- 1 Block • Scalar CPU: regist ers, dat apat hs, inst ruct ion f et ch logic • Vect or r egist er Diagram – Fixed lengt h memor y bank holding a single vect or – Typically 8-32 vect or regist ers, each holding 1 t o 8 Kbit s – Has at least 2 r ead and 1 wr it e por t s • Simple 16-bit RR inst r – MM: Can be viewed as array of 64b, 32b, 16b, or 8b element s • 32-bit wit h immed • Vect or f unct ional unit s (FUs) – Fully pipelined, st ar t new oper at ion ever y clock • Nat ur al combinat ions of – Typically 2 t o 8 FUs: int eger and FP scalar and vect or – Mult iple dat apat hs (pipelines) used f or each unit t o process mult iple element s per cycle • Scalar bit- vect or s • Vect or load-st ore unit s (LSUs) mat ch vect or lengt h – Fully pipelined unit t o load or st or e a vect or • Gat her / scat t er M-R – Mult iple element s f et ched/ st or ed per cycle – May have mult iple LSUs • Cond. mer ge • Cross-bar t o connect FUs , LSUs, regist ers CS252/ Culler CS252/ Culler Lec 20. 7 Lec 20. 8 4/ 9/ 02 4/ 9/ 02 Basic Vector I nstructions Vector Memory Operations I nst r. Operands Operat ion Comment • Load/ st or e oper at ions move gr oups of dat a bet ween r egist er s and memor y VADD.VV V1,V2,V3 V1=V2+V3 vect or + vect or • Thr ee t ypes of addr essing VADD.SV V1,R0,V2 V1=R0+V2 scalar + vect or V1=V2xV3 vect or x vect or – Unit st r ide VMUL.VV V1,V2,V3 • Fast est V1=R0xV2 scalar x vect or VMUL.SV V1,R0,V2 – Non- unit (const ant ) st r ide V1=M[R1..R1+63] load, st ride=1 VLD V1,R1 – I ndexed (gat her- scat t er) V1=M[R1..R1+63*R2] load, st ride=R2 VLDS V1,R1,R2 • Vect or equivalent of regist er indirect VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gat her") • Good f or sparse arrays of dat a M[R1..R1+63]=V1 st ore, st ride=1 VST V1,R1 • I ncreases number of programs t hat vect or ize V1=M[R1..R1+63*R2] st ore, st ride=R2 VSTS V1,R1,R2 • compress/ expand variant also V1=M[R1+V2i,i=0..63] indexed(“scat t er") VSTX V1,R1,V2 • Suppor t f or var ious combinat ions of dat a widt hs in memory + all t he regular scalar inst ruct ions (RI SC st yle)… – {.L,.W,.H.,.B} x {64b, 32b, 16b, 8b} CS252/ Culler CS252/ Culler Lec 20. 9 Lec 20. 10 4/ 9/ 02 4/ 9/ 02 Vector Code Example Vector Length • A vect or r egist er can hold some maximum number of Y[0:63] = Y[0:653] + a*X[0:63] element s f or each dat a widt h (maximum vect or lengt h 64 element SAXPY: scalar 64 element SAXPY: vect or or MVL) LD R0,a LD R0,a #load scalar a • What t o do when t he applicat ion vect or lengt h is not ADDI R4,Rx,#512 VLD V1,Rx #load vector X exact ly MVL? loop: LD R2, 0(Rx) VMUL.SV V2,R0,V1 #vector mult • Vect or-lengt h (VL) r egist er cont r ols t he lengt h of any MULTD R2,R0,R2 VLD V3,Ry #load vector Y LD R4, 0(Ry) vect or oper at ion, including a vect or load or st or e VADD.VV V4,V2,V3 #vector add ADDD R4,R2,R4 – E.g. vadd.vv wit h VL=10 is VST Ry,V4 #store vector Y SD R4, 0(Ry) for (I=0; I<10; I++) V1[I]=V2[I]+V3[I] ADDI Rx,Rx,#8 ADDI Ry,Ry,#8 • VL can be anyt hing f r om 0 t o MVL SUB R20,R4,Rx • How do you code an applicat ion wher e t he vect or BNZ R20,loop lengt h is not known unt il r un- t ime? CS252/ Culler CS252/ Culler Lec 20. 11 Lec 20. 12 4/ 9/ 02 4/ 9/ 02 2

  3. Strip Mining Optimization 1: Chaining • Suppose applicat ion vect or lengt h > MVL • Suppose: • St rip mining vmul.vv V1,V2,V3 vadd.vv V4,V1,V5 # RAW hazard – Gener at ion of a loop t hat handles MVL element s per it er at ion – A set operat ions on MVL element s is t ranslat ed t o a single vect or • Chaining inst r uct ion – Vect or regist er (V1) is not as a single ent it y but as a • Example: vect or saxpy of N element s group of individual regist ers – First loop handles (N mod MVL) element s, t he rest handle MVL – P ipeline f orwarding can work on individual vect or element s • Flexible chaining: allow vect or t o chain t o any ot her VL = (N mod MVL); // set VL = N mod MVL act ive vect or oper at ion => mor e r ead/ wr it e por t s for (I=0; I<VL; I++) // 1 st loop is a single set of Y[I]=A*X[I]+Y[I]; // vector instructions Unchained low = (N mod MVL); vadd Cray X-mp vmul VL = MVL; // set VL to MVL introduces for (I=low; I<N; I++) // 2 nd loop requires N/MVL memory chaining vmul Y[I]=A*X[I]+Y[I]; // sets of vector instructions Chained CS252/ Culler CS252/ Culler vadd Lec 20. 13 Lec 20. 14 4/ 9/ 02 4/ 9/ 02 Chaining & Multi- lane Example Optimization 2: Multi- lane I mplementation Scalar LSU FU0 FU1 Pipelined Lane Dat apat h vld vmul.vv Vect or Reg. Par t it ion vadd.vv Funct ional addu Unit Time vld vmul.vv To/ Fr om Memor y Syst em vadd.vv • Element s f or vect or regist ers int erleaved across t he lanes addu • Each lane receives ident ical cont rol • Mult iple element operat ions execut ed per cycle • Modular, scalable design Element Oper at ions: I nst r. I ssue: • No need f or int er - lane communicat ion f or most vect or inst ruct ions • VL=16, 4 lanes, 2 FUs, 1 LSU, chaining -> 12 ops/ cycle CS252/ Culler CS252/ Culler • J ust one new inst r uct ion issued per cycle !!!! Lec 20. 15 Lec 20. 16 4/ 9/ 02 4/ 9/ 02 Two Ways to View Vectorization Optimization 3: Conditional Execution • Suppose you want t o vect orize t his: • I nner loop vect or izat ion (Classic appr oach) for (I=0; I<N; I++) – Think of machine as, say, 32 vect or regist ers each wit h 16 if (A[I]!= B[I]) A[I] -= B[I]; element s • Solut ion: vect or condit ional execut ion – 1 inst ruct ion updat es 32 element s of 1 vect or regist er – Add vect or f lag regist ers wit h single-bit element s – Good f or vect orizing single- dimension arrays or regular kernels (e.g. saxpy) – Use a vect or compare t o set t he a f lag r egist er • Out er loop vect or izat ion (post-CM2) – Use f lag r egist er as mask cont r ol f or t he vect or sub • Addit ion execut ed only f or vect or element s wit h – Think of machine as 16 “virt ual processors” (VP s) cor r esponding f lag element set each wit h 32 scalar regist ers! (- mult it hreaded processor) – 1 inst ruct ion updat es 1 scalar regist er in 16 VPs • Vect or code – Good f or irregular kernels or kernels wit h loop- carried vld V1, Ra dependences in t he inner loop vld V2, Rb vcmp.neq.vv F0, V1, V2 # vector compare • These ar e j ust t wo compiler per spect ives vsub.vv V3, V2, V1, F0 # conditional vadd – The hardware is t he same f or bot h vst V3, Ra – Cray uses vector mask & merge CS252/ Culler CS252/ Culler Lec 20. 17 Lec 20. 18 4/ 9/ 02 4/ 9/ 02 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend