cs184a computer architecture structures and organization
play

CS184a: Computer Architecture (Structures and Organization) Day19: - PDF document

CS184a: Computer Architecture (Structures and Organization) Day19: November 27, 2000 Specialization Caltech CS184a Fall2000 -- DeHon 1 Previously How to support bit processing operations Generalizing tasks Caltech CS184a Fall2000


  1. CS184a: Computer Architecture (Structures and Organization) Day19: November 27, 2000 Specialization Caltech CS184a Fall2000 -- DeHon 1 Previously • How to support bit processing operations • Generalizing tasks Caltech CS184a Fall2000 -- DeHon 2 1

  2. Today • What bit operations do I need to perform? • Specialization – Binding Time – Specialization Time Models – Specialization Benefits – Expression Caltech CS184a Fall2000 -- DeHon 3 Quote • The fastest instructions you can execute, are the ones you don’t. Caltech CS184a Fall2000 -- DeHon 4 2

  3. Idea • Minimize computation • Instantaneous computing requirements less than general case • Some data known or predictable – compute minimum computational residue • Dual of generalization we saw for local control Caltech CS184a Fall2000 -- DeHon 5 Opportunity Exists • Spatial unfolding of computation – can afford more specificity of operation • Fold (early) bound data into problem • Common/exceptional cases Caltech CS184a Fall2000 -- DeHon 6 3

  4. Opportunity • Arises for programmables – can change their instantaneous implementation – don’t have to cover all cases with a configuration – can be heavily specialized • while still capable of solving entire problem – (all problems, all cases) Caltech CS184a Fall2000 -- DeHon 7 Opportunity • With bit level control – large space of optimization than word level • When branching costly – more important exploit restricted/simplified cases • While true for both spatial and temporal programmables – bigger effect/benefits for spatial Caltech CS184a Fall2000 -- DeHon 8 4

  5. Multiply Example Caltech CS184a Fall2000 -- DeHon 9 Multiply Show • Specialization in datapath width • Specialization in data Caltech CS184a Fall2000 -- DeHon 10 5

  6. Typical Optimization • Once know another piece of information about a computation – data value, parameter, usage limit • Fold into computation – producing smaller computational residue Caltech CS184a Fall2000 -- DeHon 11 Benefits Empirical Examples Caltech CS184a Fall2000 -- DeHon 12 6

  7. Benefit Examples • UART • Pattern match • Less than • Multiply revisited – more than just constant propagation • ATR Caltech CS184a Fall2000 -- DeHon 13 UART • I8251 Intel (PC) standard UART • Many operating modes – bits – parity – sync/async • Run in same mode for length of connection Caltech CS184a Fall2000 -- DeHon 14 7

  8. UART FSMs Caltech CS184a Fall2000 -- DeHon 15 UART Composite Caltech CS184a Fall2000 -- DeHon 16 8

  9. Pattern Match • Savings: – 2N bit input computation → N – if N variable, maybe trim unneeded – state elements store target – control load target Caltech CS184a Fall2000 -- DeHon 17 Pattern Match Caltech CS184a Fall2000 -- DeHon 18 9

  10. Less Than • Area depend on target value • But all targets less than generic comparison Caltech CS184a Fall2000 -- DeHon 19 Multiply (revisited) • Specialization can be more than constant propagation • Naïve, – save product term generation – complexity number of 1’s in constant input • Can do better exploiting algebraic properties Caltech CS184a Fall2000 -- DeHon 20 10

  11. Multiply • Never really need more than  N/2  one bits in constant • If more than N/2 ones: – invert c (2 N+1 -1-c) – (less than N/2 ones) – multiply by x (2 N+1 -1-c)x – add x (2 N+1 -c)x – subtract from (2 N+1 )x cx Caltech CS184a Fall2000 -- DeHon 21 Multiply • At most  N/2  +2 adds for any constant • Exploiting common subexpressions can do better: – e.g. • c=10101010 • t1=x+x<<2 • t2=t1<<5+t1<<1 Caltech CS184a Fall2000 -- DeHon 22 11

  12. Multiply Caltech CS184a Fall2000 -- DeHon 23 Example: ATR • Automatic Target Recognition – need to score image for a number of different patterns • different views of tanks, missles, etc. – reduce target image to a binary template with don’t cares – need to track many (e.g. 70-100) templates for each image region – templates themselves are sparse • small fraction of care pixels Caltech CS184a Fall2000 -- DeHon 24 12

  13. Example: ATR • 16x16x2=512 flops to • ~800 LUTs here hold single target • Maybe fit 1 generic pattern template in XC4010 • 16x16=256 LUTs to (400 CLBs)? compute match • 256 score bits → 8b score ~ 500 adder bits in tree • more for retiming Caltech CS184a Fall2000 -- DeHon 25 Example: UCLA ATR • UCLA – specialize to template – ignore don’t care pixels – only build adder tree to care pixels – exploit common subexpressions – get 10 templates in a XC4010 Caltech CS184a Fall2000 -- DeHon 26 13

  14. Example: FIR Filtering Y i = w 1 x i + w 2 x i+1 +... Application metric: TAPs = filter taps multiply accumulate Caltech CS184a Fall2000 -- DeHon 27 Usage Classes Caltech CS184a Fall2000 -- DeHon 28 14

  15. Specialization Usage Classes • Known binding time • Dynamic binding, persistent use – apparent – empirical • Common case Caltech CS184a Fall2000 -- DeHon 29 Known Binding Time • Sum=0 • Scale(max,min,V) – for I=0 → V.length • For I=0 → N • tmp=(V[I]-min) – Sum+=V[I] • Vres[I]=tmp/(max-min) • For I=0 → N – VN[I]=V[I]/Sum Caltech CS184a Fall2000 -- DeHon 30 15

  16. Dynamic Binding Time • cexp=0; • Thread 1: • For I=0 → V.length – a=src.read() – if (a.newavg()) – if (V[I].exp!=cexp) • avg=a.avg() • cexp=V[I].exp; – Vres[I]= • Thread 2: • V[I].mant<<cexp – v=data.read() – out.write(v/avg) Caltech CS184a Fall2000 -- DeHon 31 Empirical Binding • Have to check if value changed – Checking value O(N) area [pattern match] – Interesting because computations • can be O(2 N ) [Day 8] • often greater area than pattern match Caltech CS184a Fall2000 -- DeHon 32 16

  17. Common/Exceptional Case • For I=0 → N • For IB=0 → N/B – For II= 0 → B – Sum+=V[I] • I=II+IB – delta=V[I]-V[I-1] • Sum+=V[I] – SumSq+=V[I]*V[I] • delta=V[I]-V[I-1] – …. • SumSq+=V[I]*V[I] – if (overflow) • …. • …. – if (overflow) • …. Caltech CS184a Fall2000 -- DeHon 33 Binding Times • Pre-fabrication • Application/algorithm selection • Compilation • Installation • Program startup (load time) • Instantiation (new ...) • Epochs • Procedure • Loop Caltech CS184a Fall2000 -- DeHon 34 17

  18. Exploitation Models • Full Specialization • Worst-case pre-allocation – e.g. multiplier worst-case, avg., this case • Range specialization – data width • Template / placeholder Caltech CS184a Fall2000 -- DeHon 35 Opportunity Example Caltech CS184a Fall2000 -- DeHon 36 18

  19. Bit Constancy Lattice • binding time for bits of variables (storage- based) …… Constant between definitions CBD …… + signed SCBD …… Constant in some scope invocations CSSI …… + signed SCSSI …… Constant in each scope invocation CESI …… + signed SCESI …… Constant across scope invocations CASI …… + signed SCASI …… Constant across program invocations CAPI …… declared const const [Experiment: Eylon Caspi/UCB] Caltech CS184a Fall2000 -- DeHon 37 Experiments • Applications: – UCLA MediaBench: adpcm, epic, g721, gsm, jpeg, mesa, mpeg2 (not shown today - ghostscript, pegwit, pgp, rasta) – gzip, versatility, SPECint95 (parts) • Compiler optimize --> instrument for profiling --> run • analyze variable usage, ignore heap – heap-reads typically 0-10% of all bit-reads – 90-10 rule (variables) - ~90% of bit reads in 1- 20% or bits [Experiment: Eylon Caspi/UCB] Caltech CS184a Fall2000 -- DeHon 38 19

  20. Empirical Bit-Reads Classification Bit-Read Classification - Variables (MediaBench, averaged per program) const 0.3% CBD 15% SCBD const SCASI 7% SCASI 40% CASI SCESI CESI SCSSI CSSI CSSI SCBD 13% CBD SCSSI 7% CESI CASI SCESI 5% 11% 2% [Experiment: Eylon Caspi/UCB] Caltech CS184a Fall2000 -- DeHon 39 Bit-Reads Classification • regular across programs – SCASI, CASI, CBD stddev ~11% • nearly no activity in variables declared const • ~65% in constant + signed bits – trivially exploited [Experiment: Eylon Caspi/UCB] Caltech CS184a Fall2000 -- DeHon 40 20

  21. Constant Bit-Ranges • 32b data paths are too wide • 55% of all bit-reads are to sign-bits • most CASI reads clustered in bit-ranges (10% of 11%) • CASI+SCASI reads (50%) are positioned: – 2% low-order 8% whole-word constant 39% high-order 1% elsewhere [Experiment: Eylon Caspi/UCB] Caltech CS184a Fall2000 -- DeHon 41 Issue Roundup Caltech CS184a Fall2000 -- DeHon 42 21

  22. Expressing • Generators • Instantiation (disallow mutation once created) • Special methods (only allow mutation with) • Data Flow (binding time apparent) • Control Flow – (explicitly separate common/uncommon case) • Empirical discovery Caltech CS184a Fall2000 -- DeHon 43 Benefits • Much of the benefits come from reduced area – reduced area • room for more spatial operation • maybe less interconnect delay • Fully exploiting, full specialization – don’t know how big a block is until see values – dynamic resource scheduling (next quarter?) Caltech CS184a Fall2000 -- DeHon 44 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend