Leveraging High Performance g g g Data Cache Techniques to Save - - PDF document

leveraging high performance g g g data cache techniques
SMART_READER_LITE
LIVE PREVIEW

Leveraging High Performance g g g Data Cache Techniques to Save - - PDF document

4/26/2012 Leveraging High Performance g g g Data Cache Techniques to Save Power in Embedded Systems Major Bhadauria, Sally A. McKee, Karan Singh, Gary S. Tyson Process Technology Leakage Problem 100,000 Lower Operating Voltage 10,000


slide-1
SLIDE 1

4/26/2012 1

Leveraging High Performance g g g Data Cache Techniques to Save Power in Embedded Systems

Major Bhadauria, Sally A. McKee, Karan Singh, Gary S. Tyson

Process Technology Leakage Problem

 Lower Operating

Voltage

10,000 100,000

 Lower Transistor

Threshold

 Exponential Increase

1 10 100 1,000

30 40 50 60 70 80 90 100 110

Ioff (nA/um) 0.25um 0.18um 0.13um 0.1um

p In Leakage

Leakage vs. Temperature

Temperature (C)

slide-2
SLIDE 2

4/26/2012 2

Outline

 Cache Power Reduction Solutions  Leakage Issue  Possible Solutions  Our Reuse Distance (RD) policy  Energy and Delay Performance

Future Work

 Future Work

Cache Power Reduction

 Reduce Dynamic Power

 Partition caches horizontally via cache banking or region  Partition caches horizontally via cache banking or region

caches lee+cases00

 Partition cache vertically using filter caches or line buffers

kamble+islped97,kin+ieeetc00  Reduce Static Power

 Utilize high-VT threshold transistors  Dynamically turn off dead lines kaxiras+isca01  Dynamically turn off dead lines  Dynamically put to sleep unused lines flautner+isca02

slide-3
SLIDE 3

4/26/2012 3

Region Caches

 Partition data cache

into: stack global and into: stack, global and heap regions*

 Steer accesses to

cache structures using virtual address*

Multiple Access Caches

Target Way-Associative Performance without power overhead: power overhead:

 Column-associative caches check secondary

cache line on miss, extra bit to indicate whether tag line hashed

 MRU two-way associative caches check

cache ways sequentially rather than parallel, extra bit for MRU way

slide-4
SLIDE 4

4/26/2012 4

Leakage Reduction

 High-VT Static Solution  Replace transistors with high-VT ones  Static increase in latency  Gated-VDD Decay Caches (State Losing)  Turn off unused cache lines (loses data)  Requires sleeper transistors  Adaptive Body Biasing (ABB) & Drowsy Caches

(Retain State)

 Significant delay and dynamic power consumption

b t k f ABB between wakeup for ABB

 Requires special manufacturing process for ABB  DVS for leakage reduction with drowsy caches  Extra circuitry required for both

Previous Drowsy Leakage Policies

 Simple

 Turn off all cache lines every X cycles  Little overhead, power/performance is variable

 No Access

 Turn off cache line if not accessed within X cycles  Counters required per cache line

 Reuse Most Recently On (RMRO)  Reuse Most Recently On (RMRO)

 No Access policy specifically for cache ways  Requires some bits per cache set, only 1 counter

slide-5
SLIDE 5

4/26/2012 5

Reuse Distance (RD) Policy

 Measures time using cache accesses to

increment counters increment counters

 Keeps only the last N accesses “awake” for

an RD of size N

 Ensures only N lines are ever awake  Clock cycle independent

Clock cycle independent

 Gives upper bound for power envelope

Reuse Distance (RD) LRU

 True LRU too expensive, substitute with:

 Quasi-LRU via saturating counters

Cl i ti i ti t

 Close approximations via timestamps

7 1 1 23 2 2 3

RD N=4 Cache Accesses Check These Bits Drowsy Misses Increment These Bits

Cache Line LRU Counter

99 1 23 2 2 3 832 3 3 1 833

slide-6
SLIDE 6

4/26/2012 6

We Apply

 Region caches with the heap cache size

reduced by half multiple access cache to reduced by half, multiple access cache to retain performance

 Drowsy cache using the RD policy  Target embedded architecture and

applications

Experimental Setup

 Alpha 21264 Architecture/ISA,

 HotLeakage Simulator  HotLeakage Simulator  1.5GHz, 70nm, 80 degrees Simulator Parameters

 SPEC2000 Benchmarks Using SimPoints  2 Level Cache Hierarchy

 32KB 32 byte 4-Way L1 D-Cache (1 cycle)  4-Way Unified L2 256KB/512KB/1MB/2MB

 Drowsy Policies  Drowsy Policies

 Simple Policy 4K Cycles (NoAccess omitted)  RMRO 256  RD 15

slide-7
SLIDE 7

4/26/2012 7

Column Associative MRU

slide-8
SLIDE 8

4/26/2012 8

Reuse Coverage Performance

0.99 0.992 simple RD 0.976 0.978 0.98 0.982 0.984 0.986 0.988 PC Normalized to DM Simple 0.97 0.972 0.974 CA MRU IP

slide-9
SLIDE 9

4/26/2012 9

Dynamic Energy

1 2 1.4

simple 2-way associative simple column associative simple MRU

0.6 0.8 1 1.2

Consumption Normalized Simple Direct-Mapped

0.2 0.4

Power to

Static Energy

0 1 0.12

simple RD

0.04 0.06 0.08 0.1

Leakage normalized to DM Non-Drowsy Cache

0.02 heap stack global

Region

slide-10
SLIDE 10

4/26/2012 10

Total Power Consumption

0.45 0.5

M Cache simple RD

0.15 0.2 0.25 0.3 0.35 0.4

Normalized to Non-Drowsy DM

0.05 0.1 DM CA MRU

Total Power

Conclusion

 Cache Power Reductions

 Dynamic power reductions achieved via multiple

y p p access caches

 Significant leakage reduction through RD policy  Minimal performance degradation

 Future Work

 Investigate cache interaction in CMP systems  Investigate cache interaction in CMP systems  Use compiler hints for static cache assignments

slide-11
SLIDE 11

4/26/2012 11

Q&A