Leveraging High Performance g g g Data Cache Techniques to Save - PDF document

4/26/2012 Leveraging High Performance g g g Data Cache Techniques to Save Power in Embedded Systems Major Bhadauria, Sally A. McKee, Karan Singh, Gary S. Tyson Process Technology Leakage Problem 100,000  Lower Operating Voltage 10,000 Ioff (nA/um) 0.25um 1,000 0.18um 0.13um 100 0.1um  Lower Transistor 10 Threshold  Exponential Increase p 1 30 40 50 60 70 80 90 100 110 In Leakage Temperature (C) Leakage vs. Temperature 1

4/26/2012 Outline  Cache Power Reduction Solutions  Leakage Issue  Possible Solutions  Our Reuse Distance ( RD) policy  Energy and Delay Performance  Future Work Future Work Cache Power Reduction  Reduce Dynamic Power  Partition caches horizontally via cache banking or region  Partition caches horizontally via cache banking or region caches lee+cases00  Partition cache vertically using filter caches or line buffers kamble+islped97 , kin+ieeetc00  Reduce Static Power  Utilize high-VT threshold transistors  Dynamically turn off dead lines kaxiras+isca01  Dynamically turn off dead lines  Dynamically put to sleep unused lines flautner+isca02 2

4/26/2012 Region Caches  Partition data cache into: stack global and into: stack, global and heap regions*  Steer accesses to cache structures using virtual address* Multiple Access Caches Target Way-Associative Performance without power overhead: power overhead:  Column-associative caches check secondary cache line on miss, extra bit to indicate whether tag line hashed  MRU two-way associative caches check cache ways sequentially rather than parallel, extra bit for MRU way 3

4/26/2012 Leakage Reduction  High-VT Static Solution  Replace transistors with high-VT ones  Static increase in latency  Gated-VDD Decay Caches (State Losing)  Turn off unused cache lines (loses data)  Requires sleeper transistors  Adaptive Body Biasing (ABB) & Drowsy Caches (Retain State)  Significant delay and dynamic power consumption between wakeup for ABB b t k f ABB  Requires special manufacturing process for ABB  DVS for leakage reduction with drowsy caches  Extra circuitry required for both Previous Drowsy Leakage Policies  Simple  Turn off all cache lines every X cycles  Little overhead, power/performance is variable  No Access  Turn off cache line if not accessed within X cycles  Counters required per cache line  Reuse Most Recently On (RMRO)  Reuse Most Recently On (RMRO)  No Access policy specifically for cache ways  Requires some bits per cache set, only 1 counter 4

4/26/2012 Reuse Distance (RD) Policy  Measures time using cache accesses to increment counters increment counters  Keeps only the last N accesses “awake” for an RD of size N  Ensures only N lines are ever awake  Clock cycle independent Clock cycle independent  Gives upper bound for power envelope Reuse Distance (RD) LRU  True LRU too expensive, substitute with:  Quasi-LRU via saturating counters  Close approximations via timestamps Cl i ti i ti t LRU Cache Line Counter RD N=4 Cache Accesses Check Drowsy Misses These Bits Increment These Bits 0 7 1 2 1 1 23 23 2 2 3 3 99 833 2 3 0 3 832 0 1 5

4/26/2012 We Apply  Region caches with the heap cache size reduced by half multiple access cache to reduced by half, multiple access cache to retain performance  Drowsy cache using the RD policy  Target embedded architecture and applications Experimental Setup  Alpha 21264 Architecture/ISA,  HotLeakage Simulator  HotLeakage Simulator  1.5GHz, 70nm, 80 degrees Simulator Parameters  SPEC2000 Benchmarks Using SimPoints  2 Level Cache Hierarchy  32KB 32 byte 4-Way L1 D-Cache (1 cycle)  4-Way Unified L2 256KB/512KB/1MB/2MB  Drowsy Policies  Drowsy Policies  Simple Policy 4K Cycles (NoAccess omitted)  RMRO 256  RD 15 6

4/26/2012 Column Associative MRU 7

4/26/2012 Reuse Coverage Performance simple RD 0.992 0.99 0.988 PC Normalized to DM Simple 0.986 0.984 0.982 0.98 0.978 0.976 IP 0.974 0.972 0.97 CA MRU 8

4/26/2012 Dynamic Energy simple 2-way associative simple column associative simple MRU 1.4 1 2 1.2 Consumption Normalized Simple Direct-Mapped 1 0.8 0.6 Power to 0.4 0.2 0 Static Energy simple RD 0.12 0.1 0 1 Leakage normalized to DM Non-Drowsy Cache 0.08 0.06 0.04 0.02 0 heap stack global Region 9

4/26/2012 Total Power Consumption simple RD 0.5 M Cache 0.45 Normalized to Non-Drowsy DM 0.4 0.35 0.3 0.25 0.2 0.15 Total Power 0.1 0.05 0 DM CA MRU Conclusion  Cache Power Reductions  Dynamic power reductions achieved via multiple y p p access caches  Significant leakage reduction through RD policy  Minimal performance degradation  Future Work  Investigate cache interaction in CMP systems  Investigate cache interaction in CMP systems  Use compiler hints for static cache assignments 10

4/26/2012 Q&A 11

Leveraging High Performance g g g Data Cache Techniques to Save - PDF document

4/26/2012 Leveraging High Performance g g g Data Cache Techniques to Save Power in Embedded Systems Major Bhadauria, Sally A. McKee, Karan Singh, Gary S. Tyson Process Technology Leakage Problem 100,000 Lower Operating Voltage 10,000

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

CSE378 - Cache Performance metrics for caches Parameters for cache design Basic performance

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

High performance data processing with Halide Roel Jordans High performance computjng 2

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

We Won a TAACCCT Grant! Now what about the Third-Party Evaluation? Informational Webinar

Economic and Social Impacts of Micronance January 2011 () Impacts of Micronance January

Absolutely Continuous Compensators Conference in Honor of Walter Schachermayer Philip Protter

Numerical Methods for Solving Large Scale Eigenvalue Problems Lecture 2, February 28, 2018:

Automatic Design and Manufacture of Robotic Life CPSC 607 Yuan Luo Outline Background

Nonlinear Equations and Optimization Motivation: Nonlinear Equations So far we have mostly

QUASILINEAR PREFERENCES Utility additive, y and linear in y : U ( x, y ) = F ( x ) + y , Example:

CS599: Algorithm Design in Strategic Settings Fall 2012 Lecture 3: Mechanism Design Preliminaries