1 Components of a Vector Processor Cray- 1 Block Scalar CPU: - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Components of a Vector Processor Cray- 1 Block Scalar CPU: - - PDF document

Vector Processors CS252 I nit ially developed f or super -comput ing applicat ions, Graduate Computer Architecture t oday impor t ant f or mult imedia. Lecture 20 Vect or pr ocessor s have high- level oper at ions t hat Vector


slide-1
SLIDE 1

1

CS252 Graduate Computer Architecture Lecture 20

Vector Processing => Multimedia

David E. Culler Many slides due to Christ of oros E. Kozyrakis

CS252/ Culler Lec 20. 2 4/ 9/ 02

Vector Processors

  • I nit ially developed f or super -comput ing applicat ions,

t oday impor t ant f or mult imedia.

  • Vect or pr ocessor s have high- level oper at ions t hat

wor k on linear ar r ays of number s: "vect or s" + r1 r2 r3

add r3, r1, r2

SCALAR (1 operation) v1 v2 v3 +

vector length

vadd.vv v3, v1, v2

VECTOR (N operations)

CS252/ Culler Lec 20. 3 4/ 9/ 02

Properties of Vector Processors

  • Single vect or inst r uct ion implies lot s of wor k (loop)

– Fewer inst ruct ion f et ches

  • Each r esult independent of pr evious r esult

– Mult iple operat ions can be execut ed in parallel – Simpler design, high clock rat e – Compiler (programmer) ensures no dependencies

  • Reduces br anches and br anch pr oblems in pipelines
  • Vect or inst r uct ions access memor y wit h known

pat t ern

– Ef f ect ive pref et ching – Amort ize memory lat ency of over large number of element s – Can exploit a high bandwidt h memory syst em – No (dat a) caches required!

CS252/ Culler Lec 20. 4 4/ 9/ 02

Styles of Vector Architectures

  • Memory- memor y vect or pr ocessor s

– All vect or operat ions are memory t o memory

  • Vect or-r egist er pr ocessor s

– All vect or operat ions bet ween vect or regist ers (except vect or load and st ore) – Vect or equivalent of load- st ore archit ect ures – I ncludes all vect or machines since lat e 1980s – We assume vect or-regist er f or rest of t he lect ure

CS252/ Culler Lec 20. 5 4/ 9/ 02

Historical Perspective

  • Mid-60s f ear per f . st agnat es
  • SI MD pr ocessor ar r ays

act ively developed dur ing lat e 60’s – mid 70’s

– bit - parallel machines f or image processing

  • pepe, st aran, mpp

– wor d- parallel f or scient if ic

  • I lliac I V
  • Cr ay develops f ast scalar

– CDC 6600, 7600

  • CDC bet s of vect or s wit h

St ar-100

  • Amdahl ar gues against vect or

CS252/ Culler Lec 20. 6 4/ 9/ 02

Cray- 1 Breakthrough

  • Fast , simple scalar processor

– 80 MHz! – single-phase, lat ches

  • Exquisit e elect rical and mechanical design
  • Semiconduct or memory
  • Vect or regist er concept

– vast simplif icat ion of inst ruct ion set – r educed necc. memory bandwidt h

  • Tight int egrat ion of vect or and scalar
  • P

iggy- back of f 7600 st acklib

  • Lat er vect orizing compilers developed
  • Owned high- perf ormance comput ing f or a decade

– what happened t hen? – VLI W compet it ion

slide-2
SLIDE 2

2

CS252/ Culler Lec 20. 7 4/ 9/ 02

Components of a Vector Processor

  • Scalar CPU: regist ers, dat apat hs, inst ruct ion f et ch logic
  • Vect or r egist er

– Fixed lengt h memor y bank holding a single vect or – Typically 8-32 vect or regist ers, each holding 1 t o 8 Kbit s – Has at least 2 r ead and 1 wr it e por t s – MM: Can be viewed as array of 64b, 32b, 16b, or 8b element s

  • Vect or f unct ional unit s (FUs)

– Fully pipelined, st ar t new oper at ion ever y clock – Typically 2 t o 8 FUs: int eger and FP – Mult iple dat apat hs (pipelines) used f or each unit t o process mult iple element s per cycle

  • Vect or load-st ore unit s (LSUs)

– Fully pipelined unit t o load or st or e a vect or – Mult iple element s f et ched/ st or ed per cycle – May have mult iple LSUs

  • Cross-bar t o connect FUs , LSUs, regist ers

CS252/ Culler Lec 20. 8 4/ 9/ 02

Cray- 1 Block Diagram

  • Simple 16-bit RR inst r
  • 32-bit wit h immed
  • Nat ur al combinat ions of

scalar and vect or

  • Scalar bit- vect or s

mat ch vect or lengt h

  • Gat her / scat t er M-R
  • Cond. mer ge

CS252/ Culler Lec 20. 9 4/ 9/ 02

Basic Vector I nstructions

I nst r. Operands Operat ion Comment VADD.VV V1,V2,V3 V1=V2+V3 vect or + vect or VADD.SV V1,R0,V2 V1=R0+V2 scalar + vect or VMUL.VV V1,V2,V3 V1=V2xV3 vect or x vect or VMUL.SV V1,R0,V2 V1=R0xV2 scalar x vect or VLD V1,R1 V1=M[R1..R1+63] load, st ride=1 VLDS V1,R1,R2 V1=M[R1..R1+63*R2] load, st ride=R2 VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gat her") VST V1,R1 M[R1..R1+63]=V1 st ore, st ride=1 VSTS V1,R1,R2 V1=M[R1..R1+63*R2] st ore, st ride=R2 VSTX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed(“scat t er") + all t he regular scalar inst ruct ions (RI SC st yle)…

CS252/ Culler Lec 20. 10 4/ 9/ 02

Vector Memory Operations

  • Load/ st or e oper at ions move gr oups of dat a

bet ween r egist er s and memor y

  • Thr ee t ypes of addr essing

– Unit st r ide

  • Fast est

– Non- unit (const ant ) st r ide – I ndexed (gat her- scat t er)

  • Vect or equivalent of regist er indirect
  • Good f or sparse arrays of dat a
  • I ncreases number of programs t hat vect or ize
  • compress/ expand variant also
  • Suppor t f or var ious combinat ions of dat a widt hs in

memory

– {.L,.W,.H.,.B} x {64b, 32b, 16b, 8b}

CS252/ Culler Lec 20. 11 4/ 9/ 02

Vector Code Example

64 element SAXPY: scalar

LD R0,a ADDI R4,Rx,#512 loop: LD R2, 0(Rx) MULTD R2,R0,R2 LD R4, 0(Ry) ADDD R4,R2,R4 SD R4, 0(Ry) ADDI Rx,Rx,#8 ADDI Ry,Ry,#8 SUB R20,R4,Rx BNZ R20,loop

64 element SAXPY: vect or

LD R0,a #load scalar a VLD V1,Rx #load vector X VMUL.SV V2,R0,V1 #vector mult VLD V3,Ry #load vector Y VADD.VV V4,V2,V3 #vector add VST Ry,V4 #store vector Y

Y[0:63] = Y[0:653] + a*X[0:63]

CS252/ Culler Lec 20. 12 4/ 9/ 02

Vector Length

  • A vect or r egist er can hold some maximum number of

element s f or each dat a widt h (maximum vect or lengt h

  • r MVL)
  • What t o do when t he applicat ion vect or lengt h is not

exact ly MVL?

  • Vect or-lengt h (VL) r egist er cont r ols t he lengt h of any

vect or oper at ion, including a vect or load or st or e

– E.g. vadd.vv wit h VL=10 is for (I=0; I<10; I++) V1[I]=V2[I]+V3[I]

  • VL can be anyt hing f r om 0 t o MVL
  • How do you code an applicat ion wher e t he vect or

lengt h is not known unt il r un- t ime?

slide-3
SLIDE 3

3

CS252/ Culler Lec 20. 13 4/ 9/ 02

Strip Mining

  • Suppose applicat ion vect or lengt h >

MVL

  • St rip mining

– Gener at ion of a loop t hat handles MVL element s per it er at ion – A set operat ions on MVL element s is t ranslat ed t o a single vect or inst r uct ion

  • Example: vect or saxpy of N element s

– First loop handles (N mod MVL) element s, t he rest handle MVL VL = (N mod MVL); // set VL = N mod MVL for (I=0; I<VL; I++) // 1st loop is a single set of Y[I]=A*X[I]+Y[I]; // vector instructions low = (N mod MVL); VL = MVL; // set VL to MVL for (I=low; I<N; I++) // 2nd loop requires N/MVL Y[I]=A*X[I]+Y[I]; // sets of vector instructions

CS252/ Culler Lec 20. 14 4/ 9/ 02

Optimization 1: Chaining

  • Suppose:

vmul.vv V1,V2,V3 vadd.vv V4,V1,V5 # RAW hazard

  • Chaining

– Vect or regist er (V1) is not as a single ent it y but as a group of individual regist ers – P ipeline f orwarding can work on individual vect or element s

  • Flexible chaining: allow vect or t o chain t o any ot her

act ive vect or oper at ion => mor e r ead/ wr it e por t s

vmul vadd vmul vadd Unchained Chained Cray X-mp introduces memory chaining

CS252/ Culler Lec 20. 15 4/ 9/ 02

Optimization 2: Multi- lane I mplementation

  • Element s f or vect or regist ers int erleaved across t he lanes
  • Each lane receives ident ical cont rol
  • Mult iple element operat ions execut ed per cycle
  • Modular, scalable design
  • No need f or int er - lane communicat ion f or most vect or

inst ruct ions

To/ Fr om Memor y Syst em Pipelined Dat apat h Funct ional Unit Lane

Vect or Reg. Par t it ion

CS252/ Culler Lec 20. 16 4/ 9/ 02

Chaining & Multi- lane Example

  • VL=16, 4 lanes, 2 FUs, 1 LSU, chaining ->

12 ops/ cycle

  • J ust one new inst r uct ion issued per cycle !!!!

vld vmul.vv vadd.vv addu vld vmul.vv vadd.vv addu LSU FU0 FU1 Scalar Time

Element Oper at ions: I nst r. I ssue:

CS252/ Culler Lec 20. 17 4/ 9/ 02

Optimization 3: Conditional Execution

  • Suppose you want t o vect orize t his:

for (I=0; I<N; I++) if (A[I]!= B[I]) A[I] -= B[I];

  • Solut ion: vect or condit ional execut ion

– Add vect or f lag regist ers wit h single-bit element s – Use a vect or compare t o set t he a f lag r egist er – Use f lag r egist er as mask cont r ol f or t he vect or sub

  • Addit ion execut ed only f or vect or element s wit h

cor r esponding f lag element set

  • Vect or code

vld V1, Ra vld V2, Rb vcmp.neq.vv F0, V1, V2 # vector compare vsub.vv V3, V2, V1, F0 # conditional vadd vst V3, Ra –Cray uses vector mask & merge

CS252/ Culler Lec 20. 18 4/ 9/ 02

Two Ways to View Vectorization

  • I nner loop vect or izat ion (Classic appr oach)

– Think of machine as, say, 32 vect or regist ers each wit h 16 element s – 1 inst ruct ion updat es 32 element s of 1 vect or regist er – Good f or vect orizing single- dimension arrays or regular kernels (e.g. saxpy)

  • Out er loop vect or izat ion (post-CM2)

– Think of machine as 16 “virt ual processors” (VP s) each wit h 32 scalar regist ers! (- mult it hreaded processor) – 1 inst ruct ion updat es 1 scalar regist er in 16 VPs – Good f or irregular kernels or kernels wit h loop- carried dependences in t he inner loop

  • These ar e j ust t wo compiler per spect ives

– The hardware is t he same f or bot h

slide-4
SLIDE 4

4

CS252/ Culler Lec 20. 19 4/ 9/ 02

Vectorizing Matrix Mult

// Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get c[i][j] for (i=1; i<n; i++) { for (j=1; j<n; j++) { sum = 0; for (t=1; t<n; t++) { sum += a[i][t] * b[t][j]; // loop-carried } // dependence c[i][j] = sum; } }

CS252/ Culler Lec 20. 20 4/ 9/ 02

Parallelize I nner Product

* * * * + +

Sum of Part ial Product s

CS252/ Culler Lec 20. 21 4/ 9/ 02

Outer- loop Approach

// Outer-loop Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get c[i][j] // 32 elements of the result calculated in parallel // with each iteration of the j-loop (c[i][j:j+31]) for (i=1; i<n; i++) { for (j=1; j<n; j+=32) { // loop being vectorized sum[0:31] [0:31] = 0; for (t=1; t<n; t++) { ascalar = a[i][t]; // scalar load bvector[0:31] = b[t][j:j+31]; // vector load prod[0:31] = b_vector[0:31]*ascalar; // vector mul sum[0:31] += prod[0:31]; // vector add } c[i][j:j+31] = sum[0:31]; // vector store } }

CS252/ Culler Lec 20. 22 4/ 9/ 02

Approaches to Mediaprocessing

Mult imedia Processing

General- purpose processors with SI MD extensions Vector Processors VLI W with SI MD extensions (aka mediaprocessors) DSPs ASI Cs/ FPGAs

CS252/ Culler Lec 20. 23 4/ 9/ 02

What is Multimedia Processing?

  • Deskt op:

– 3D graphics (games) – Speech recognit ion (voice input ) – Video/ audio decoding (mpeg- mp3 playback)

  • Servers:

– Video/ audio encoding (video servers, I P t elephony) – Digit al libraries and media mining (video servers) – Comput er animat ion, 3D modeling & rendering (movies)

  • Embedded:

– 3D graphics (game consoles) – Video/ audio decoding & encoding (set t op boxes) – I mage processing (digit al cameras) – Signal processing (cellular phones)

CS252/ Culler Lec 20. 24 4/ 9/ 02

The Need f or Multimedia I SAs

  • Why ar en’t gener al-pur pose pr ocessor s and I SAs

suf f icient f or mult imedia (despit e Moor e’s law)?

  • Per f or mance

– A 1.2GHz At hlon can do MP EG- 4 encoding at 6.4f ps – One 384Kbps W- CDMA channel requires 6.9 GOP S

  • Power consumpt ion

– A 1.2GHz At hlon consumes ~60W – P

  • wer consumpt ion increases wit h clock f requency and

complexit y

  • Cost

– A 1.2GHz At hlon cost s ~$62 t o manuf act ure and has a list price of ~$600 (module) – Cost increases wit h complexit y, area, t ransist or count , power, et c

slide-5
SLIDE 5

5

CS252/ Culler Lec 20. 25 4/ 9/ 02

Example: MPEG Decoding

Parsing Dequant izat ion I DCT Block Reconst ruct ion RGB- >YUV I nput St r eam Out put t o Scr een 10% 20% 25% 30% 15% Load Br eakdown

CS252/ Culler Lec 20. 26 4/ 9/ 02

Example: 3D Graphics

Transf orm Lighting

Display List s Out put t o Screen Geomet ry P ipe

Setup Rast erizat ion Anti- aliasing Shading, f ogging Texture mapping Alpha blending Z- buf f er Clipping Frame- buf f er ops

Rendering P ipe

10% 10% 35% 55% Load Br eakdown

CS252/ Culler Lec 20. 27 4/ 9/ 02

Characteristics of Multimedia Apps (1)

  • Requir ement f or r eal-t ime r esponse

– “I ncorrect ” result of t en pref erred t o slow result – Unpredict abilit y can be bad (e.g. dynamic execut ion)

  • Narrow dat a-t ypes

– Typical widt h of dat a in memory: 8 t o 16 bit s – Typical widt h of dat a during comput at ion: 16 t o 32 bit s – 64-bit dat a t ypes rarely needed – Fixed- point arit hmet ic of t en replaces f loat ing- point

  • Fine-gr ain (dat a) par allelism

– I dent ical operat ion applied on st reams of input dat a – Branches have high predict abilit y – High inst ruct ion localit y in small loops or kernels

CS252/ Culler Lec 20. 28 4/ 9/ 02

Characteristics of Multimedia Apps (2)

  • Coarse-gr ain par allelism

– Most apps organized as a pipeline of f unct ions – Mult iple t hreads of execut ion can be used

  • Memor y r equir ement s

– High bandwidt h requirement s but can t olerat e high lat ency – High spat ial localit y (predict able pat t ern) but low t emporal localit y – Cache bypassing and pref et ching can be crucial

CS252/ Culler Lec 20. 29 4/ 9/ 02

Examples of Media Functions

  • Mat rix t ranspose/ mult iply
  • DCT/ FFT
  • Mot ion est imat ion
  • Gamma correct ion
  • Haar t ransf orm
  • Median f ilt er
  • Separable convolut ion
  • Vit erbi decode
  • Bit packing
  • Galois- f ields arit hmet ic

(3D graphics) (Video, audio, communicat ions) (Video) (3D graphics) (Media mining) (I mage processing) (I mage processing) (Communicat ions, speech) (Communicat ions, crypt ography) (Communicat ions, crypt ography)

CS252/ Culler Lec 20. 30 4/ 9/ 02

SI MD Extensions f or GPP

  • Mot ivat ion

– Low media- processing perf ormance of GP P s – Cost and lack of f lexibilit y of specialized ASI Cs f or graphics/ video – Underut ilized dat apat hs and regist ers

  • Basic idea: sub-wor d par allelism

– Treat a 64- bit regist er as a vect or of 2 32-bit or 4 16- bit

  • r 8 8 - bit values (short vect ors)

– Part it ion 64-bit dat apat hs t o handle mult iple narrow

  • perat ions in parallel
  • I nit ial const r aint s

– No addit ional archit ect ure st at e (regist ers) – No addit ional except ions – Minimum area overhead

slide-6
SLIDE 6

6

CS252/ Culler Lec 20. 31 4/ 9/ 02

Overview of SI MD Extensions

8x128 (new) 24 (f p) 99 E 3DNow! AMD 01 ? 98 98 98 97 95 94,95

Year

8x128 (new) 144 (int ,f p) SSE-2 I nt el FP 32x64b 23 (f p) MI PS-3D MI PS 8x128b (new) 70 (f p) SSE I nt el 32x128b (new) 162 (int ,f p) Alt ivec Mot or ola FP 8x64b 21 (f p) 3DNow! AMD FP 8x64b 57 (int ) MMX I nt el FP 32x64b 121 (int ) VI S Sun I nt 32x64b 9,8 (int ) MAX-1 and 2 HP

Registers # I nstr Extension Vendor

CS252/ Culler Lec 20. 32 4/ 9/ 02

Summary of SI MD Operations (1)

  • I nt eger ar it hmet ic

– Addit ion and subt ract ion wit h sat urat ion – Fixed- point rounding modes f or mult iply and shif t – Sum of absolut e dif f erences – Mult iply- add, mult iplicat ion wit h reduct ion – Min, max

  • Float ing-point ar it hmet ic

– P acked f loat ing- point operat ions – Square root , reciprocal – Except ion masks

  • Dat a communicat ion

– Merge, insert , ext ract – P ack, unpack (widt h conversion) – P ermut e, shuf f le

CS252/ Culler Lec 20. 33 4/ 9/ 02

Summary of SI MD Operations (2)

  • Compar isons

– I nt eger and FP packed comparison – Compare absolut e values – Element masks and bit vect ors

  • Memory

– No new load- st ore inst ruct ions f or short vect or

  • No support f or st rides or indexing

– Short vect ors handled wit h 64b load and st ore inst ruct ions – P ack, unpack, shif t , rot at e, shuf f le t o handle alignment of narrow dat a-t ypes wit hin a wider one – P ref et ch inst ruct ions f or ut ilizing t emporal localit y

CS252/ Culler Lec 20. 34 4/ 9/ 02

Programming with SI MD Extensions

  • Opt imized shared libraries

– Writ t en in assembly, dist ribut ed by vendor – Need well def ined API f or dat a f or mat and use

  • Language macros f or variables and operat ions

– C/ C++ wrappers f or short vect or variables and f unct ion calls – Allows inst r uct ion scheduling and r egist er allocat ion opt imizat ions f or specif ic processors – Lack of por t abilit y, non st andar d

  • Compilers f or SI MD ext ensions

– No commer cially available compiler so f ar – P roblems

  • Language suppor t f or expr essing f ixed-point arit hmet ic and

SI MD par allelism

  • Complicat ed model f or loading/ st oring vect ors
  • Fr equent updat es
  • Assembly coding

CS252/ Culler Lec 20. 35 4/ 9/ 02

SI MD Perf ormance

2 4 6 8

Athlon Alpha 21264 Pentium III PowerPC G 4 UltraSparc IIi

Speedup over Base Architecture for Berkeley Media Benchmarks Arithmetic Mean Geometic Mean

Limit at ions

  • Memory bandwidt h
  • Overhead of handling alignment and dat a widt h adj ust ment s

CS252/ Culler Lec 20. 36 4/ 9/ 02

A Closer Look at MMX/ SSE

  • Higher speedup f or kernels wit h narrow dat a where 128b

SSE inst ruct ions can be used

  • Lower speedup f or t hose wit h irregular or st rided accesses

PentiumIII (500MHz) with MMX/SSE

6.4 4.9 1.3 5.6 1.7 2.8 3.8 2 2.5 1.5 7.6 1.3 2.2 1.8 4.7 31.1 2 4 6 8 10 Speedup over Base Architecture

slide-7
SLIDE 7

7

CS252/ Culler Lec 20. 37 4/ 9/ 02

Choosing the Data Type Width

  • Alt er nat ives f or select ing t he widt h of element s in

a vect or r egist er (64b, 32b, 16b, 8b)

  • Separ at e inst r uct ions f or each widt h

– E.g. vadd64, vadd32, vadd16, vadd8 – P

  • pular wit h SI MD ext ensions f or GP

P s – Uses t oo many opcodes

  • Specif y it in a cont r ol r egist er

– Virt ual- processor widt h (VP W) – Updat ed only on widt h changes

  • NOTE

– MVL increases when widt h (VP W) get s narrower – E.g. wit h 2Kbit s f or regist er, MVL is 32,64,128,256 f or 64- ,32- ,16- ,8-bit dat a respect ively – Always pick t he narrowest VP W needed by t he applicat ion

CS252/ Culler Lec 20. 38 4/ 9/ 02

Other Features f or Multimedia

  • Support f or f ixed-point ar it hmet ic

– Sat urat ion, rounding- modes et c

  • Per mut at ion inst r uct ions of vect or r egist er s

– For reduct ions and FFTs – Not general permut at ions (t oo expensive)

  • Example: per mut at ion f or r educt ions

– Move 2 nd half a a vect or regist er int o anot her one – Repeat edly use wit h vadd t o execut e reduct ion – Vect or lengt h halved af t er each st ep

15 16 63 V0 15 16 63 V1

CS252/ Culler Lec 20. 39 4/ 9/ 02

Designing a Vector Processor

  • Changes t o scalar core
  • How t o pick t he maximum vect or lengt h?
  • How t o pick t he number of vect or r egist er s?
  • Cont ext swit ch over head?
  • Except ion handling?
  • Masking and f lag inst r uct ions?

CS252/ Culler Lec 20. 40 4/ 9/ 02

Changes to Scalar Processor

  • Decode vect or inst r uct ions
  • Send scalar r egist er s t o vect or unit

(vect or-scalar ops)

  • Synchr onizat ion f or r esult s back f r om vect or

r egist er , including except ions

  • Things t hat don’t r un in vect or don’t have high I LP,

so can make scalar CPU simple

CS252/ Culler Lec 20. 41 4/ 9/ 02

How to Pick Max. Vector Length?

  • Vect or lengt h =>

Keep all VFUs busy:

  • Vect or lengt h >

=

  • Not es:

– Single inst ruct ion issue is always t he simplest – Don’t f orget you have t o issue some scalar inst ruct ions as well – Cray get mileage f rom VL < = word lengt h

(# lanes) X (# VFUs ) # Vect or inst r . issued/ cycle

CS252/ Culler Lec 20. 42 4/ 9/ 02

How to P ick # of Vector Registers?

  • Mor e vect or r egist er s:

– Reduces vect or regist er “spills” (save/ rest ore) – Aggressive scheduling of vect or inst ruct ions: bet t er compiling t o t ake advant age of I LP

  • Fewer

– Fewer bit s in inst ruct ion f ormat (usually 3 f ields)

  • 32 vect or r egist er s ar e usually enough
slide-8
SLIDE 8

8

CS252/ Culler Lec 20. 43 4/ 9/ 02

Context Switch Overhead?

  • The vect or regist er f ile holds a huge amount of archit ect ural

st at e

– To expensive t o save and rest ore all on each cont ext swit ch – Cr ay: exchange packet

  • Ext ra dirt y bit per processor

– I f vect or regist ers not writ t en, don’t need t o save on cont ext swit ch

  • Ext ra valid bit per vect or regist er, cleared on process st art

– Don’t need t o r est or e on cont ext swit ch unt il needed

  • Ext ra t ip:

– Save/ r est or e vect or st at e only if t he new cont ext needs t o issue vect or inst r uct ions

CS252/ Culler Lec 20. 44 4/ 9/ 02

Exception Handling: Arithmetic

  • Ar it hmet ic t r aps ar e har d
  • Pr ecise int er r upt s =>

lar ge per f or mance loss

– Mult imedia applicat ions don’t care much about arit hmet ic t raps anyway

  • Alt er nat ive model

– St ore except ion inf ormat ion in vect or f lag regist ers – A set f lag bit indicat es t hat t he corresponding element

  • perat ion caused an except ion

– Sof t ware insert s t rap barrier inst ruct ions f rom SW t o check t he f lag bit s as needed – I EEE f loat ing point requires 5 f lag regist ers (5 t ypes of t r aps)

CS252/ Culler Lec 20. 45 4/ 9/ 02

Exception Handling: Page Faults

  • Page f ault s must be pr ecise

– I nst ruct ion page f ault s not a problem – Dat a page f ault s harder

  • Opt ion 1: Save/ r est or e int er nal vect or unit st at e

– Freeze pipeline, (dump all vect or st at e), f ix f ault , (rest ore st at e and) cont inue vect or pipeline

  • Opt ion 2: expand memor y pipeline t o check all

addresses bef ore send t o memory

– Requires address and inst ruct ion buf f ers t o avoid st alls dur ing addr ess checks – On a page- f ault on only needs t o save st at e in t hose buf f ers – I nst ruct ions t hat have cleared t he buf f er can be allowed t o complet e

CS252/ Culler Lec 20. 46 4/ 9/ 02

Exception Handling: I nterrupts

  • I nt er r upt s due t o ext er nal sour ces

– I / O, t imers et c

  • Handled by t he scalar core
  • Should t he vect or unit be int er r upt ed?

– Not immediat ely (no cont ext swit ch) – Only if it causes an except ion or t he int errupt handler needs t o execut e a vect or inst ruct ion

CS252/ Culler Lec 20. 47 4/ 9/ 02

Vector Power Consumption

  • Can t rade-of f par allelism f or power

– P

  • wer = C *Vdd 2 * f

– I f we double t he lanes, peak perf ormance doubles – Halving f rest ores peak perf ormance but also allows halving of t he Vdd – P

  • wer new = (2C)*(Vdd/ 2)2*(f / 2) = P
  • wer/ 4
  • Simpler logic

– Replicat ed cont rol f or all lanes – No mult iple issue or dynamic execut ion logic

  • Simpler t o gat e clocks

– Each vect or inst ruct ion explicit ly describes all t he resources it needs f or a number of cycles – Condit ional execut ion leads t o f urt her savings

CS252/ Culler Lec 20. 48 4/ 9/ 02

Why Vectors f or Multimedia?

  • Nat ural mat ch t o parallelism in mult imedia

– Vect or oper at ions wit h VL t he image or f r ame widt h – Easy t o ef f icient ly suppor t vect or s of nar r ow dat a t ypes

  • High perf ormance at low cost

– Mult iple ops/ cycle while issuing 1 inst r / cycle – Mult iple ops/ cycle at low power consumpt ion – St ruct ured access pat t ern f or regist ers and memory

  • Scalable

– Get higher per f or mance by adding lanes wit hout ar chit ect ur e modif icat ions

  • Compact code size

– Describe N operat ions wit h 1 short inst ruct ion (v. VLI W)

  • P

redict able perf ormance

– No need f or caches, no dynamic execut ion

  • Mat ure, developed compiler t echnology
slide-9
SLIDE 9

9

CS252/ Culler Lec 20. 49 4/ 9/ 02

A Vector Media

  • P

rocessor: VI RAM

  • Technology: I BM SA- 27E

– 0.18mm CMOS, 6 copper layers

  • 280 mm2 die area

– 158 mm2 DRAM, 50 mm2 logic

  • Transist or count : ~115M

– 14 Mbyt es DRAM

  • P
  • wer supply & consumpt ion

– 1.2V f or logic, 1.8V f or DRAM – 2W at 1.2V

  • P

eak perf ormance – 1.6/ 3.2 / 6.4 Gops (64/ 32/ 16b ops) – 3.2/ 6.4/ 12.8 Gops (wit h madd) – 1.6 Gf lops (single- precision)

  • Designed by 5 graduat e st udent s

CS252/ Culler Lec 20. 50 4/ 9/ 02

Perf ormance Comparison

  • QCI F and CI F numbers are in clock cycles per f rame
  • All ot her numbers are in clock cycles per pixel
  • MMX result s assume no f irst level cache misses

140M (5.0x) 28M CIF (352x288) 33M (4.6x) 7.1M QCIF (176x144) 5.49 (4.5x) 1.23 Image Convolution 8.00 (10.2x) 0.78 Color Conversion 3.75 (5.0x) 0.75 iDCT MMX VIRAM

CS252/ Culler Lec 20. 51 4/ 9/ 02

FFT (1)

FFT (Floating-point, 1024 points)

36 16.8 25 69 92 124.3 40 80 120 160 Execution Time (usec) VIRAM Pathfinder-2 Wildstar TigerSHARC ADSP-21160 TMS320C6701

CS252/ Culler Lec 20. 52 4/ 9/ 02

FFT (2)

FFT (Fixed-point, 256 points)

7.2 8.1 9 7.3 87 151 40 80 120 160 Execution Time (usec) VIRAM Pathfinder-1 Carmel TigerSHARC PPC 604E Pentium

CS252/ Culler Lec 20. 53 4/ 9/ 02

SI MD Summary

  • Nar r ow vect or ext ensions f or GPPs

– 64b or 128b regist ers as vect ors of 32b, 16b, and 8b element s

  • Based on sub-wor d par allelism and par t it ioned

dat apat hs

  • I nst r uct ions

– P acked f ixed- and f loat ing- point , mult iply- add, reduct ions – P ack, unpack, permut at ions – Limit ed memory support

  • 2x t o 4x per f or mance impr ovement over base

ar chit ect ur e

– Limit ed by memory bandwidt h

  • Dif f icult t o use (no compiler s)

CS252/ Culler Lec 20. 54 4/ 9/ 02

Vector Summary

  • Alt er nat ive model f or explicit ly expr essing dat a

par allelism

  • I f code is vect or izable, t hen simpler har dwar e,

mor e power ef f icient , and bet t er r eal-t ime model t han out-of -or der machines wit h SI MD suppor t

  • Design issues include number of lanes, number of

f unct ional unit s, number of vect or r egist er s, lengt h

  • f vect or r egist er s, except ion handling, condit ional
  • per at ions
  • Will mult imedia popular it y r evive vect or

ar chit ect ur es?