Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - - PowerPoint PPT Presentation

trace caches
SMART_READER_LITE
LIVE PREVIEW

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - - PowerPoint PPT Presentation

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches [Rotenberg96] Trace Caches For those not in the know: I$ that captures dynamic instruction sequences trace n instructions (cache


slide-1
SLIDE 1

Trace Caches

and optimizations therein

CSE 240C - Rushi Chakrabarti - Winter 2009

slide-2
SLIDE 2

Trace Caches

[Rotenberg’96]

slide-3
SLIDE 3

Trace Caches

For those not in the know:

  • I$ that captures dynamic instruction

sequences

  • trace
  • n instructions (cache line size) or
  • m basic blocks (branch predictor

throughput)

  • + starting address
slide-4
SLIDE 4

Trace Caches

valid bit - is trace valid? tag - starting address branch flags - predictor bits mask - is last inst branch? fall thru - last branch is not taken target - if last branch is taken

slide-5
SLIDE 5

Fill Units

  • Originally proposed to take a stream of

scalar instruction and compact them into VLIW-type instructions.

  • These instructions go in a shadow cache.
  • Sound familiar?

[Melvin’88] && [Franklin’94]

slide-6
SLIDE 6

Differences

  • Not conceptually, but their aim is different.
  • Trace caches => high BW instr fetching
  • Fill Units => ease multiple issue complexity
slide-7
SLIDE 7

The Fill Unit Today

  • Nowadays, papers refer to the fill unit as

the mechanism that feeds trace caches

slide-8
SLIDE 8

Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors

  • Trace caches are more awesome than we

thought since they sit off the main fetch- issue pipeline.

  • This makes them latency tolerant.
  • So, we can introduce extra “logic” to help

place instructions into the trace cache

slide-9
SLIDE 9
slide-10
SLIDE 10

Optimization I

  • Register Moves
  • ADD Rx <- Ry + 0
  • Rename output register to
  • same physical register
  • same operand tag
slide-11
SLIDE 11

Optimization I

slide-12
SLIDE 12

Optimization II

  • Reassociation
  • ADD Rx <- Ry + 4
  • ADD Rz <- Rx + 4 => ADD Rz <- Ry + 8
  • (Does so across control flow boundaries)
slide-13
SLIDE 13

Optimization II

slide-14
SLIDE 14

Optimization III

  • Scaled Adds
  • SHIFT Rw <- Rx << 1
  • ADD Ry <- Rw + Rz =>
  • SCALEADD Ry <- (Rx << 1) + Rz
  • (Limit to 3-bit shifts)
slide-15
SLIDE 15

Optimization III

slide-16
SLIDE 16

Optimization IV

  • Instruction Placement
  • Operand bypassing, etc, can be a burden
  • If we can place the instructions in a better
  • rder to ease this we can see some

performance.

slide-17
SLIDE 17

Optimization IV

slide-18
SLIDE 18

Combined Results

slide-19
SLIDE 19

Instruction Path Coprocessors

  • Programmable on-chip coprocessor
  • Has its own ISA
  • Operates on core instr to transform them

into an efficient internal format

slide-20
SLIDE 20

What good are these?

  • Example: Intel P6
  • Converts x86 into uops (CISC on RISC)
  • Since it operates on instructions, and sits
  • utside the main pipeline it make it perfect

for...fill units

slide-21
SLIDE 21
slide-22
SLIDE 22

OG I-COP

  • All about dynamic code modification
  • No change to ISA or to HW necessary
  • However, compiler generated object

code isn’t what is being run [Chou’00]

slide-23
SLIDE 23

OG I-COP

  • The original implementation was statically

scheduled, exploited parallelism using VLIW

  • Each I-COP can have more than one

VLIW engine (called slices). This helped with ILP

slide-24
SLIDE 24

So what’s wrong?

  • Takes quite a bit of hardware. Many slices

replicated, each needing its own I-mem

  • Takes up a lot of area on the chip
slide-25
SLIDE 25

Enter PipeRench

  • Reconfigurable fabric for computation

(originally on a stream/media applications)

  • This can allow us to map programs to

hardware.

  • The key to PipeRench is reconfiguration is

supposedly fast.

slide-26
SLIDE 26

Reconfiguration

  • Reconfiguration is done using a “scrolling window”
slide-27
SLIDE 27

PipeRench

  • More Hardware = More Throughput
slide-28
SLIDE 28

Pipelined

  • Virtual stripes allow for efficient area usage
slide-29
SLIDE 29

Inside the Stripes

  • Using 0.18um, 1 stripe is 1.03 sq mm
slide-30
SLIDE 30

PipeRench Roadmap ‘97

  • 28 stripes in .35 um tech
  • 32 PEs in each stripe
  • 512 stripes of configuration cache (18

configs)

  • Speed: 100MHz
slide-31
SLIDE 31

Performance Example

  • IDEA Encryption (Symmetric Encryption

for PGP)

  • 232 virtual stripes, 64 bits wide
  • PipeRench: 940MB/sec
  • ASIC: 177 Mb/sec in 1993
  • ASIC: 2GB/sec in 1997
  • Pentium ~ 1Mb/sec
  • Using 232 rows => 7.8 GB/sec
slide-32
SLIDE 32

DIL

  • PipeRench configurations are written in

Dataflow Intermediate Language

  • Output is a set of configuration bits (one set

per virtual stripe).

slide-33
SLIDE 33

PipeRench advantages

  • Write DIL once, # of physical stripes

doesn’t matter

  • Apply DIL code selectively at run-time
slide-34
SLIDE 34

PipeRench I-COP

  • Use PipeRench to implement I-COPs.
  • Compare to original

VLIW I-COP implementation.

  • See where the best trade-off point is.
slide-35
SLIDE 35

Dynamic Code Modifications

  • Trace Cache Fill Unit => 11

V-stripes

  • Register Move (done for trace run +5x)
  • 22

V-stripes (plus 11)

  • Stride Data Prefetching => 14

V-stripes

  • LDS Prefetching => 9

V-Stripes

slide-36
SLIDE 36

VLIW Equivalents

  • Trace construction => 3 PL, 15 physical
  • Register Move => IPC 2.69 to 2.72
  • Stride Prefetch => Reduce to only 9

physical stripes

slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40

Area Evaluation

  • If we maintain the I-COP at .5 core speed
  • 33 physical stripes is ~ 34 sq mm.
  • 9 physical stripes ~ 9.27 sq mm
slide-41
SLIDE 41

?