Region Caching: Motivation Region Caching: Motivation High Level - - PDF document

region caching motivation region caching motivation
SMART_READER_LITE
LIVE PREVIEW

Region Caching: Motivation Region Caching: Motivation High Level - - PDF document

Gary Tyson Region Caching: Motivation Region Caching: Motivation High Level Languages influence the memory reference behavior f b h i Caused by translating complex semantics into simple code (somewhat application independent)


slide-1
SLIDE 1

Gary Tyson 1

Region Caching: Motivation Region Caching: Motivation

 High Level Languages influence the memory

f b h i reference behavior

– Caused by translating complex semantics into simple code (somewhat application independent)

 Programming conventions also predictably

influence memory reference behavior

 Exploiting memory region characteristics can

1/45

p g y g lead to more effective caches

 Attacking each region individually

– finding optimal cache designs for each region

Memory Space Partitioning Memory Space Partitioning

 Based on programming

reserved max mem

language

 Non-overlapped

subdivisions

Protected Static Data Region Dynamic Data Region Dynamic Data Region

2/45

reserved min mem

MIPS Architecture

Static Data Region Code Region

slide-2
SLIDE 2

Gary Tyson 2

Memory Space Partitioning Memory Space Partitioning

 Based on programming

reserved max mem

language

 Non-overlapped

subdivisions

 Split code and data 

I-cache & D-cache

Protected Static Data Region Dynamic Data Region Dynamic Data Region

3/45

I cache & D cache

reserved min mem

MIPS Architecture

Static Data Region Code Region

Memory Space Partitioning Memory Space Partitioning

 Based on programming

reserved max mem

language

 Non-overlapped

subdivisions

 Split code and data  I-

cache & D-cache

Protected Static Global Data Region Heap grows upward Stack grows downward

4/45

cache & D cache

 Split data into regions – Stack () – Heap () – Global (static) – Read-only (static)

reserved min mem

MIPS Architecture

Read-only data Code Region

slide-3
SLIDE 3

Gary Tyson 3

Stack Reference of Memory Instructions Stack Reference of Memory Instructions

90% 100% 30% 40% 50% 60% 70% 80% Read-only Heap ref Static ref Stack ref

5/45

0% 10% 20% b z i p 2 c r a f t y e

  • n

g a p g c c g z i p m c f p a r s e r t w

  • l

f v

  • r

t e x p e r l b m k v p r A v g

Stack + Global Stack + Global

90% 100% 30% 40% 50% 60% 70% 80% Read-only Heap ref Static ref Stack ref

6/45

0% 10% 20% b z i p 2 c r a f t y e

  • n

g a p g c c g z i p m c f p a r s e r t w

  • l

f v

  • r

t e x p e r l b m k v p r A v g

slide-4
SLIDE 4

Gary Tyson 4

Stack + Global + Heap Stack + Global + Heap

90% 100% 30% 40% 50% 60% 70% 80% Read-only Heap ref Static ref Stack ref

7/45

0% 10% 20% b z i p 2 c r a f t y e

  • n

g a p g c c g z i p m c f p a r s e r t w

  • l

f v

  • r

t e x p e r l b m k v p r A v g

Stack + Global + Heap + Read Stack + Global + Heap + Read-

  • nly Data
  • nly Data

90% 100% 30% 40% 50% 60% 70% 80% Read-only Heap ref Static ref Stack ref

8/45

0% 10% 20% b z i p 2 c r a f t y e

  • n

g a p g c c g z i p m c f p a r s e r t w

  • l

f v

  • r

t e x p e r l b m k v p r A v g

slide-5
SLIDE 5

Gary Tyson 5

Region Region-

  • based Partitioning

based Partitioning

 Run-time virtual memory address

space is partitioned by programming languages.

 Can we reduce power while

retaining performance ?

9/45

 Reference patterns and

characteristics of these data are different.

Miss Rate by Memory Region Miss Rate by Memory Region

0.2500 0.3000 0.0500 0.1000 0.1500 0.2000 miss rate H eap data G lobal static data Stack data

10/45

 Stack data level off quickly; so do global data  Heap drops linearly every time cache size

doubled

0.0000 256 512 1k 2k 4k 8k 16k 32k 64k Individual C ache Size

slide-6
SLIDE 6

Gary Tyson 6

Region Region-

  • based Cachelets (RBC)

based Cachelets (RBC)

 A simple idea

address

p

 A horizontal

partitioning

 Clock gated caches  Only enable (cycle)

the region cachelet

cs cs cs RC2 static stack L1 Cache RC1

Address DeMultiplexer

region select

11/45

the region cachelet being accessed

 Redirect >70%

accesses to smaller region cachelets

Processor Data Bus

Power Reduction of RBC Power Reduction of RBC

0.80 0.90 1.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 p e g p e g

  • d

e

  • d

e d i

  • d

i

  • d

e

  • d

e d e d e d e d e g s g e n m

  • m

a p s t a p i c p i c A v g

12/45

c j p e g d j p e g m p e g 2 e n c

  • d

m p e g 2 d e c

  • d

r a w c a u d i r a w d a u d i g 7 2 1 e n c

  • d

g 7 2 1 d e c

  • d

p g p e n c

  • d

p g p d e c

  • d

p e g w i t e n c

  • d

p e g w i t d e c

  • d

g m e s a . t e x g e m e s a .

  • s

d e m m e s a . m i p m a p r a s t e p i u n e p i A v g S4k-G4k-32kL1 vs. 32k-DM S4k-G4k-32kL1 vs. 32k-4way S4k-G4k-32kL1 vs. 40k-5way

 Dynamic cache power reduced by as much as

63%

slide-7
SLIDE 7

Gary Tyson 7

Overview of Talk Overview of Talk

 Some of our prior work in this area  Some of our prior work in this area – Region Based Cache Design – Stack Value File  Current Research Micro architectural support for Java VM

13/45

– Micro-architectural support for Java VM – Application Specific Processor Design – VM improvement (security; linkage; debug)

Improving Execution Time Improving Execution Time

 Region Caches exploit differences in  Region Caches exploit differences in

working set size for Stack, Static and Heap regions to reduce power without hurting performance.

 Our Stack Value File research exploits

14/45

specific stack reference characteristics to improve execution performance by developing a new data storage structure.

slide-8
SLIDE 8

Gary Tyson 8

Morphing $sp Morphing $sp-

  • relative References

relative References

 Morph $sp-relative references into register

accesses

 Use a Stack Value File (SVF)  Resolve address early in decode stage for

stack-pointer indexed accesses

15/45

stack pointer indexed accesses

 Resolve stack memory dependency early  Aliased references are re-routed to SVF

Microarchitecture Extension Microarchitecture Extension (PIII (PIII-

  • like)

like)

Stack Value File accesses occur at the same pipeline stage as register reads

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

same pipeline stage as register reads

16/45

Hash Max SP SP Pre-Decode

  • ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing

slide-9
SLIDE 9

Gary Tyson 9

Stack Reference Characteristics Stack Reference Characteristics

 Contiguity

–Good temporal and spatial locality –Can be stored in a simple, fast structure

  • Small die area relative to a regular cache
  • Less power dissipation

17/45

–No address tag need for each datum

  • Keep the current TOS address

Cache Distribution by Region Cache Distribution by Region

1.E+06 1.E+07

set

1.E+02 1.E+03 1.E+04 1.E+05

# of hits to each s

18/45

1 512 1.E+00 1.E+01

stack global static heap

slide-10
SLIDE 10

Gary Tyson 10

Stack Reference Characteristics Stack Reference Characteristics

 First touch is almost always a Store

Store

–Avoid waste bandwidth to bring in dead data – “Write Validate” allocation policy

 Deallocated stack frame

D d d t ( t b itt t t

19/45

–Dead data (must be written to next reference) –No need to write them back to memory

Memory Traffic Memory Traffic

 SVF dramatically reduces memory traffic

by many orders of magnitude.

– For gcc, ~28M (Stack cache  L2) reduced to ~86K (SVF  L1).

 Incoming traffic is eliminated because

SVF does not allocate a cache line on a

20/45

SVF does not allocate a cache line on a miss.

 Outgoing traffic consists of only those

words words that are dirty when evicted (instead of entire cache lines).

slide-11
SLIDE 11

Gary Tyson 11

Speedup Potential of Stack Speedup Potential of Stack Value File Value File

1.8 1.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 b i 2 ft i f t lf t lb k A

21/45

bzip2 crafty eon gap gcc gzip mcf parser twolf vortex perlbmk vpr Avg

4-wide 8-wide 16-wide 16-wide (gshare)

 Assume all references can be morphed  ~30% speedup for a 16-wide with a dual-

ported L1

Why is SVF Faster ? Why is SVF Faster ?

 It reduces the load-to-use latency of

stack references

 It effectively increases the number of

memory port by rerouting more than ½ of all memory references to the

22/45

SVF

 It reduces contention in the MOB  More flexibility in renaming stack

references

slide-12
SLIDE 12

Gary Tyson 12

MemoryLogix MLX1 Processor MemoryLogix MLX1 Processor

Taken from: Peter Song Peter Song “MLX1: A Tiny Multithreaded 586 Core for Smart Mobile Devices”. 23/45

Conclusions Conclusions

 Microarchitects need to develop new tradeoffs to  Microarchitects need to develop new tradeoffs to

achieve required design goals.

 By exploiting characteristics of the programming

environment, it is possible to design more efficient microarchitectures.

 For many embedded applications further improvement

can be made by designing custom processor for each

24/45

can be made by designing custom processor for each application.

 Ultimately, success will enable more cycles to be used

to solve tough software engineering problems.

slide-13
SLIDE 13

Gary Tyson 13

Backup Foils Backup Foils Baseline Microarchitecture Baseline Microarchitecture

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

26/45

ArchRF ReOrder Buffer LSQ

slide-14
SLIDE 14

Gary Tyson 14

Microarchitecture Extension Microarchitecture Extension

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

27/45

Hash Max SP SP Pre-Decode

  • ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp)

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

TOS TOS

28/45

Hash Max SP SP Pre-Decode

  • ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing

slide-15
SLIDE 15

Gary Tyson 15

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp)

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT) Func Unit

TOS TOS

29/45

Hash Max SP SP Pre-Decode

  • ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing

3

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp) $p35 $p35  ROB ROB 18 18

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT RAT) Func Unit

TOS TOS $p35 $p35  ROB ROB-18 18

30/45

Hash Max SP SP Pre-Decode

  • ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing Morphing

slide-16
SLIDE 16

Gary Tyson 16

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp) $p35 $p35  ROB ROB 18 18

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT RAT) Func Unit

TOS TOS $p35 $p35  ROB ROB-18 18

31/45

Hash Max SP SP Pre-Decode

  • ffset

ArchRF Value Stack File ReOrder Buffer interlock LSQ Morphing Morphing

Microarchitecture Extension Microarchitecture Extension

stq $r10, 24($sp) stq $r10, 24($sp) $p35 $p35  SVF3 SVF3

Ld/St Unit Instr-Cache Decoder Fetch Decode Dispatch Issue Execute Commit MOB Reservation Station / L DecoderQ Reg Renamer (RAT RAT) Func Unit

TOS TOS $p35 $p35  SVF3 SVF3

32/45

Hash Max SP SP Pre-Decode

  • ffset

ArchRF Stack Value File ReOrder Buffer interlock LSQ Morphing Morphing

slide-17
SLIDE 17

Gary Tyson 17

Stack + Stack + Global Global + + Heap Heap + + Rdata Rdata

90% 100% 30% 40% 50% 60% 70% 80%

Rdata Heap Global Stack

33/45

0% 10% 20%

c j p e g d j p e g m p e g 2 e n c

  • d

e r m p e g 2 d e c

  • d

e r r a w c a u d i

  • r

a w d a u d i

  • g

7 2 1 e n c

  • d

e g 7 2 1 d e c

  • d

e p g p e n c

  • d

e p g p d e c

  • d

e p e g w i t e n c

  • d

e p e g w i t d e c

  • d

e g s m e s a . t e x g e n m e s a .

  • s

d e m

  • m

e s a . m i p m a p r a s t a e p i c u n e p i c A v g

Simulation Framework Simulation Framework

 Wattch simulator [Brooks et al.00] [Brooks et al.00]

C id it hi l –Consider switching power only –Simple clock gating

 Baseline microarchitecture

parameters

–Close to Intel StrongARM SA-110

34/45

–Close to Intel StrongARM SA-110 –Single-issue 5-stage in-order pipeline –Unified 32B-line 32KB L1 –Region-based caching only applied to L1

slide-18
SLIDE 18

Gary Tyson 18

Stack Depth Variation Stack Depth Variation

2000

197.parser Stack Depth in 64-bit SVF entry

1000 1500

35/45

500

Execution Timeline -->

Offset Locality of Stack Offset Locality of Stack

C l ti

90 100

 Cumulative

  • ffset within a

function call

 Avg: 3b - 380b  >80% offset

i hi “400b”

40 50 60 70 80

Cumulative %

36/45

within“400b”

 >99% offset

within“8Kb”

10 20 30 10 100 1000 10000

Offset in Bytes (Log scale)

slide-19
SLIDE 19

Gary Tyson 19

SVF Reference Type SVF Reference Type Breakdown Breakdown

90% 100% 10% 20% 30% 40% 50% 60% 70% 80%

rerouted_svf_st rerouted_svf_ld fast_svf_st fast_svf_ld

37/45

 86% stack references can be morphed  Re-routed references enter regular memory

pipeline

0%

b z i p 2 c r a f t y e

  • n

g a p g c c g z i p m c f p a r s e r t w

  • l

f v

  • r

t e x p e r l b m k v p r A v g

Vital references (by region) Vital references (by region)

100 40 60 80

  • f total vital loads

Literals Heap Static Stack 38/45 20 B Z I P 2 C R A F T Y E O N G A P G Z I P M C F P A R S E R T W O L F %