trace caches
play

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - PowerPoint PPT Presentation

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches [Rotenberg96] Trace Caches For those not in the know: I$ that captures dynamic instruction sequences trace n instructions (cache


  1. Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009

  2. Trace Caches [Rotenberg’96]

  3. Trace Caches For those not in the know: • I$ that captures dynamic instruction sequences • trace • n instructions (cache line size) or • m basic blocks (branch predictor throughput) • + starting address

  4. Trace Caches valid bit - is trace valid? tag - starting address branch flags - predictor bits mask - is last inst branch? fall thru - last branch is not taken target - if last branch is taken

  5. Fill Units [Melvin’88] && [Franklin’94] • Originally proposed to take a stream of scalar instruction and compact them into VLIW-type instructions. • These instructions go in a shadow cache. • Sound familiar?

  6. Differences • Not conceptually, but their aim is different. • Trace caches => high BW instr fetching • Fill Units => ease multiple issue complexity

  7. The Fill Unit Today • Nowadays, papers refer to the fill unit as the mechanism that feeds trace caches

  8. Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors • Trace caches are more awesome than we thought since they sit off the main fetch- issue pipeline. • This makes them latency tolerant. • So, we can introduce extra “logic” to help place instructions into the trace cache

  9. Optimization I • Register Moves • ADD Rx <- Ry + 0 • Rename output register to • same physical register • same operand tag

  10. Optimization I

  11. Optimization II • Reassociation • ADD Rx <- Ry + 4 • ADD Rz <- Rx + 4 => ADD Rz <- Ry + 8 • (Does so across control flow boundaries)

  12. Optimization II

  13. Optimization III • Scaled Adds • SHIFT Rw <- Rx << 1 • ADD Ry <- Rw + Rz => • SCALEADD Ry <- (Rx << 1) + Rz • (Limit to 3-bit shifts)

  14. Optimization III

  15. Optimization IV • Instruction Placement • Operand bypassing, etc, can be a burden • If we can place the instructions in a better order to ease this we can see some performance.

  16. Optimization IV

  17. Combined Results

  18. Instruction Path Coprocessors • Programmable on-chip coprocessor • Has its own ISA • Operates on core instr to transform them into an efficient internal format

  19. What good are these? • Example: Intel P6 • Converts x86 into uops (CISC on RISC) • Since it operates on instructions, and sits outside the main pipeline it make it perfect for...fill units

  20. OG I-COP [Chou’00] • All about dynamic code modification • No change to ISA or to HW necessary • However, compiler generated object code isn’t what is being run

  21. OG I-COP • The original implementation was statically scheduled , exploited parallelism using VLIW • Each I-COP can have more than one VLIW engine (called slices). This helped with ILP

  22. So what’s wrong? • Takes quite a bit of hardware. Many slices replicated, each needing its own I-mem • Takes up a lot of area on the chip

  23. Enter PipeRench • Reconfigurable fabric for computation (originally on a stream/media applications) • This can allow us to map programs to hardware. • The key to PipeRench is reconfiguration is supposedly fast.

  24. Reconfiguration • Reconfiguration is done using a “scrolling window”

  25. PipeRench • More Hardware = More Throughput

  26. Pipelined • Virtual stripes allow for efficient area usage

  27. Inside the Stripes • Using 0.18um, 1 stripe is 1.03 sq mm

  28. PipeRench Roadmap ‘97 • 28 stripes in .35 um tech • 32 PEs in each stripe • 512 stripes of configuration cache (18 configs) • Speed: 100MHz

  29. Performance Example • IDEA Encryption (Symmetric Encryption for PGP) • 232 virtual stripes, 64 bits wide • PipeRench: 940MB/sec • ASIC: 177 Mb/sec in 1993 • ASIC: 2GB/sec in 1997 • Pentium ~ 1Mb/sec • Using 232 rows => 7.8 GB/sec

  30. DIL • PipeRench configurations are written in Dataflow Intermediate Language • Output is a set of configuration bits (one set per virtual stripe).

  31. PipeRench advantages • Write DIL once, # of physical stripes doesn’t matter • Apply DIL code selectively at run-time

  32. PipeRench I-COP • Use PipeRench to implement I-COPs. • Compare to original VLIW I-COP implementation. • See where the best trade-off point is.

  33. Dynamic Code Modifications • Trace Cache Fill Unit => 11 V-stripes • Register Move (done for trace run +5x) • 22 V-stripes (plus 11) • Stride Data Prefetching => 14 V-stripes • LDS Prefetching => 9 V-Stripes

  34. VLIW Equivalents • Trace construction => 3 PL, 15 physical • Register Move => IPC 2.69 to 2.72 • Stride Prefetch => Reduce to only 9 physical stripes

  35. Area Evaluation • If we maintain the I-COP at .5 core speed • 33 physical stripes is ~ 34 sq mm. • 9 physical stripes ~ 9.27 sq mm

  36. ?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend