[PDF] - Region Caching: Motivation Region Caching: Motivation High Level PDF Document

SLIDE 1

Gary Tyson 1

Region Caching: Motivation Region Caching: Motivation

 High Level Languages influence the memory

f b h i reference behavior

– Caused by translating complex semantics into simple code (somewhat application independent)

 Programming conventions also predictably

influence memory reference behavior

 Exploiting memory region characteristics can

1/45

p g y g lead to more effective caches

 Attacking each region individually

– finding optimal cache designs for each region

Memory Space Partitioning Memory Space Partitioning

 Based on programming

reserved max mem

language

 Non-overlapped

subdivisions

Protected Static Data Region Dynamic Data Region Dynamic Data Region

2/45

reserved min mem

MIPS Architecture

Static Data Region Code Region

SLIDE 2

Gary Tyson 2

Memory Space Partitioning Memory Space Partitioning

 Based on programming

reserved max mem

language

 Non-overlapped

subdivisions

 Split code and data 

I-cache & D-cache

Protected Static Data Region Dynamic Data Region Dynamic Data Region

3/45

I cache & D cache

reserved min mem

MIPS Architecture

Static Data Region Code Region

Memory Space Partitioning Memory Space Partitioning

 Based on programming

reserved max mem

language

 Non-overlapped

subdivisions

 Split code and data  I-

cache & D-cache

Protected Static Global Data Region Heap grows upward Stack grows downward

4/45

cache & D cache

 Split data into regions – Stack () – Heap () – Global (static) – Read-only (static)

reserved min mem

MIPS Architecture

Read-only data Code Region

SLIDE 3

Gary Tyson 3

Stack Reference of Memory Instructions Stack Reference of Memory Instructions

90% 100% 30% 40% 50% 60% 70% 80% Read-only Heap ref Static ref Stack ref

5/45

0% 10% 20% b z i p 2 c r a f t y e

n

g a p g c c g z i p m c f p a r s e r t w

l

f v

r

t e x p e r l b m k v p r A v g

Stack + Global Stack + Global

90% 100% 30% 40% 50% 60% 70% 80% Read-only Heap ref Static ref Stack ref

6/45

0% 10% 20% b z i p 2 c r a f t y e

n

g a p g c c g z i p m c f p a r s e r t w

l

f v

r

t e x p e r l b m k v p r A v g

SLIDE 4

Gary Tyson 4

Stack + Global + Heap Stack + Global + Heap

90% 100% 30% 40% 50% 60% 70% 80% Read-only Heap ref Static ref Stack ref

7/45

0% 10% 20% b z i p 2 c r a f t y e

n

g a p g c c g z i p m c f p a r s e r t w

l

f v

r

t e x p e r l b m k v p r A v g

Stack + Global + Heap + Read Stack + Global + Heap + Read-

nly Data
nly Data

90% 100% 30% 40% 50% 60% 70% 80% Read-only Heap ref Static ref Stack ref

8/45

0% 10% 20% b z i p 2 c r a f t y e

n

g a p g c c g z i p m c f p a r s e r t w

l

f v

r

t e x p e r l b m k v p r A v g

SLIDE 5

Gary Tyson 5

Region Region-

based Partitioning

based Partitioning

 Run-time virtual memory address

space is partitioned by programming languages.

 Can we reduce power while

retaining performance ?

9/45

 Reference patterns and

characteristics of these data are different.

Miss Rate by Memory Region Miss Rate by Memory Region

0.2500 0.3000 0.0500 0.1000 0.1500 0.2000 miss rate H eap data G lobal static data Stack data

10/45

 Stack data level off quickly; so do global data  Heap drops linearly every time cache size

doubled

0.0000 256 512 1k 2k 4k 8k 16k 32k 64k Individual C ache Size

SLIDE 6

Gary Tyson 6

Region Region-

based Cachelets (RBC)

based Cachelets (RBC)

 A simple idea

address

p

 A horizontal

partitioning

 Clock gated caches  Only enable (cycle)

the region cachelet

cs cs cs RC2 static stack L1 Cache RC1

Address DeMultiplexer

region select

11/45

the region cachelet being accessed

 Redirect >70%

accesses to smaller region cachelets

Processor Data Bus

Power Reduction of RBC Power Reduction of RBC

0.80 0.90 1.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 p e g p e g

d

e

d

e d i

d

i

d

e

d

e d e d e d e d e g s g e n m

m

a p s t a p i c p i c A v g

12/45

c j p e g d j p e g m p e g 2 e n c

d

m p e g 2 d e c

d

r a w c a u d i r a w d a u d i g 7 2 1 e n c

d

g 7 2 1 d e c

d

p g p e n c

d

p g p d e c

d

p e g w i t e n c

d

p e g w i t d e c

d

g m e s a . t e x g e m e s a .

s

d e m m e s a . m i p m a p r a s t e p i u n e p i A v g S4k-G4k-32kL1 vs. 32k-DM S4k-G4k-32kL1 vs. 32k-4way S4k-G4k-32kL1 vs. 40k-5way

 Dynamic cache power reduced by as much as

63%

SLIDE 7

Gary Tyson 7

Overview of Talk Overview of Talk

 Some of our prior work in this area  Some of our prior work in this area – Region Based Cache Design – Stack Value File  Current Research Micro architectural support for Java VM

13/45

– Micro-architectural support for Java VM – Application Specific Processor Design – VM improvement (security; linkage; debug)

Improving Execution Time Improving Execution Time

 Region Caches exploit differences in  Region Caches exploit differences in

working set size for Stack, Static and Heap regions to reduce power without hurting performance.

 Our Stack Value File research exploits

14/45

specific stack reference characteristics to improve execution performance by developing a new data storage structure.

SLIDE 8

Gary Tyson 8

Morphing $sp Morphing $sp-

relative References

relative References

 Morph $sp-relative references into register

accesses

 Use a Stack Value File (SVF)  Resolve address early in decode stage for

stack-pointer indexed accesses

15/45

stack pointer indexed accesses

 Resolve stack memory dependency early  Aliased references are re-routed to SVF

Microarchitecture Extension Microarchitecture Extension (PIII (PIII-

like)

like)

Stack Value File accesses occur at the same pipeline stage as register reads

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

same pipeline stage as register reads

16/45

Hash Max SP SP Pre-Decode

ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing

SLIDE 9

Gary Tyson 9

Stack Reference Characteristics Stack Reference Characteristics

 Contiguity

–Good temporal and spatial locality –Can be stored in a simple, fast structure

Small die area relative to a regular cache
Less power dissipation

17/45

–No address tag need for each datum

Keep the current TOS address

Cache Distribution by Region Cache Distribution by Region

1.E+06 1.E+07

set

1.E+02 1.E+03 1.E+04 1.E+05

# of hits to each s

18/45

1 512 1.E+00 1.E+01

stack global static heap

SLIDE 10

Gary Tyson 10

Stack Reference Characteristics Stack Reference Characteristics

 First touch is almost always a Store

Store

–Avoid waste bandwidth to bring in dead data – “Write Validate” allocation policy

 Deallocated stack frame

D d d t ( t b itt t t

19/45

–Dead data (must be written to next reference) –No need to write them back to memory

Memory Traffic Memory Traffic

 SVF dramatically reduces memory traffic

by many orders of magnitude.

– For gcc, ~28M (Stack cache  L2) reduced to ~86K (SVF  L1).

 Incoming traffic is eliminated because

SVF does not allocate a cache line on a

20/45

SVF does not allocate a cache line on a miss.

 Outgoing traffic consists of only those

words words that are dirty when evicted (instead of entire cache lines).

SLIDE 11

Gary Tyson 11

Speedup Potential of Stack Speedup Potential of Stack Value File Value File

1.8 1.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 b i 2 ft i f t lf t lb k A

21/45

bzip2 crafty eon gap gcc gzip mcf parser twolf vortex perlbmk vpr Avg

4-wide 8-wide 16-wide 16-wide (gshare)

 Assume all references can be morphed  ~30% speedup for a 16-wide with a dual-

ported L1

Why is SVF Faster ? Why is SVF Faster ?

 It reduces the load-to-use latency of

stack references

 It effectively increases the number of

memory port by rerouting more than ½ of all memory references to the

22/45

SVF

 It reduces contention in the MOB  More flexibility in renaming stack

references

SLIDE 12

Gary Tyson 12

MemoryLogix MLX1 Processor MemoryLogix MLX1 Processor

Taken from: Peter Song Peter Song “MLX1: A Tiny Multithreaded 586 Core for Smart Mobile Devices”. 23/45

Conclusions Conclusions

 Microarchitects need to develop new tradeoffs to  Microarchitects need to develop new tradeoffs to

achieve required design goals.

 By exploiting characteristics of the programming

environment, it is possible to design more efficient microarchitectures.

 For many embedded applications further improvement

can be made by designing custom processor for each

24/45

can be made by designing custom processor for each application.

 Ultimately, success will enable more cycles to be used

to solve tough software engineering problems.

SLIDE 13

Gary Tyson 13

Backup Foils Backup Foils Baseline Microarchitecture Baseline Microarchitecture

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

26/45

ArchRF ReOrder Buffer LSQ

SLIDE 14

Gary Tyson 14

Microarchitecture Extension Microarchitecture Extension

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

27/45

Hash Max SP SP Pre-Decode

ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp)

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

TOS TOS

28/45

Hash Max SP SP Pre-Decode

ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing

SLIDE 15

Gary Tyson 15

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp)

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

TOS TOS

29/45

Hash Max SP SP Pre-Decode

ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing

3

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp) $p35 $p35  ROB ROB 18 18

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT RAT) Func Unit

TOS TOS $p35 $p35  ROB ROB-18 18

30/45

Hash Max SP SP Pre-Decode

ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing Morphing

SLIDE 16

Gary Tyson 16

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp) $p35 $p35  ROB ROB 18 18

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT RAT) Func Unit

TOS TOS $p35 $p35  ROB ROB-18 18

31/45

Hash Max SP SP Pre-Decode

ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing Morphing

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp) $p35 $p35  SVF3 SVF3

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT RAT) Func Unit

TOS TOS $p35 $p35  SVF3 SVF3

32/45

Hash Max SP SP Pre-Decode

ffset

ArchRF Stack Value File ReOrder Buffer interlock LSQ Morphing Morphing

SLIDE 17

Gary Tyson 17

Stack + Stack + Global Global + + Heap Heap + + Rdata Rdata

90% 100% 30% 40% 50% 60% 70% 80%

Rdata Heap Global Stack

33/45

0% 10% 20%

c j p e g d j p e g m p e g 2 e n c

d

e r m p e g 2 d e c

d

e r r a w c a u d i

r

a w d a u d i

g

7 2 1 e n c

d

e g 7 2 1 d e c

d

e p g p e n c

d

e p g p d e c

d

e p e g w i t e n c

d

e p e g w i t d e c

d

e g s m e s a . t e x g e n m e s a .

s

d e m

m

e s a . m i p m a p r a s t a e p i c u n e p i c A v g

Simulation Framework Simulation Framework

 Wattch simulator [Brooks et al.00] [Brooks et al.00]

C id it hi l –Consider switching power only –Simple clock gating

 Baseline microarchitecture

parameters

–Close to Intel StrongARM SA-110

34/45

–Close to Intel StrongARM SA-110 –Single-issue 5-stage in-order pipeline –Unified 32B-line 32KB L1 –Region-based caching only applied to L1

SLIDE 18

Gary Tyson 18

Stack Depth Variation Stack Depth Variation

2000

197.parser Stack Depth in 64-bit SVF entry

1000 1500

35/45

500

Execution Timeline -->

Offset Locality of Stack Offset Locality of Stack

C l ti

90 100

 Cumulative

ffset within a

function call

 Avg: 3b - 380b  >80% offset

i hi “400b”

40 50 60 70 80

Cumulative %

36/45

within“400b”

 >99% offset

within“8Kb”

10 20 30 10 100 1000 10000

Offset in Bytes (Log scale)

SLIDE 19

Gary Tyson 19

SVF Reference Type SVF Reference Type Breakdown Breakdown

90% 100% 10% 20% 30% 40% 50% 60% 70% 80%

rerouted_svf_st rerouted_svf_ld fast_svf_st fast_svf_ld

37/45

 86% stack references can be morphed  Re-routed references enter regular memory

pipeline

0%

b z i p 2 c r a f t y e

n

g a p g c c g z i p m c f p a r s e r t w

l

f v

r

t e x p e r l b m k v p r A v g

Vital references (by region) Vital references (by region)

100 40 60 80

f total vital loads

Literals Heap Static Stack 38/45 20 B Z I P 2 C R A F T Y E O N G A P G Z I P M C F P A R S E R T W O L F %