Gated Precharging: Reducing Bitline Precharge in Deep-Sub Caches - - PowerPoint PPT Presentation

gated precharging reducing bitline precharge in deep sub
SMART_READER_LITE
LIVE PREVIEW

Gated Precharging: Reducing Bitline Precharge in Deep-Sub Caches - - PowerPoint PPT Presentation

Gated Precharging: Reducing Bitline Precharge in Deep-Sub Caches Se-Hyun Yang and Babak Falsafi PowerTap PowerTap http://www.ece.cmu.edu/~powertap Computer Architecture Lab (CALCM) Carnegie Mellon University High Bitline Leakage in Caches


slide-1
SLIDE 1

Gated Precharging: Reducing Bitline Precharge in Deep-Subµ Caches

Se-Hyun Yang and Babak Falsafi

PowerTap PowerTap

http://www.ece.cmu.edu/~powertap Computer Architecture Lab (CALCM) Carnegie Mellon University

slide-2
SLIDE 2

High Bitline Leakage in Caches

Deep-subµ high-performance caches

Use subarrays Precharge entire cache Active subarrays: bitlines discharge Idle subarrays: bitlines leak

Energy wasted in idle subarrays !

BL BL BL BL

… … :

slide-3
SLIDE 3

Exploit Temporal Locality in Subarrays

Observation

All subarrays precharge/leak But, only small # of active subarrays

Precharge Hot Hot Precharge Precharge Precharge Precharge Precharge Precharge

slide-4
SLIDE 4

Contribution: Gated Precharging

Precharge only active subarrays Detect temporal locality

Decay counters Threshold comparison logic

Reduce precharging

by 89% in L1 d-cache by 92% in L1 i-cache with < 2% performance impact

slide-5
SLIDE 5

Outline

Overview

Bitline Leakage

Gated Precharging

Temporal Locality in Subarrays Implementation Gating Overhead

Related Work Results Conclusion

slide-6
SLIDE 6

Bitline Leakage in SRAM Cells

More than 60% discharge in 0.10µ

BL BL

Wordline

slide-7
SLIDE 7

How Much Temporal Locality?

We evaluate, in a small window

1.

How many accesses reuse subarrays?

2.

How many active subarrays?

slide-8
SLIDE 8

Subarray Reuse Ratio

Even in a small window, high subarray reuse e.g., gcc with 32K L1D with 1K subarrays

96% accesses reuse subarrays in 100-cycle window

For all benchmarks, in 100-cycle window

95% for d-cache 98% for i-cache

0% 20% 40% 60% 80% 100% 1 10 100 1000 10000 100000 Subarray access interval (cycles) Cummulative fraction

  • f d-cache accesses
slide-9
SLIDE 9

Fraction of Subarrays Accessed

In a small window, small # of active subarrays e.g., gcc with 32K L1D with 1K subarrays

19% of subarrays accessed in 100-cycle window

For all benchmarks, in 100-cycle window

< 29% for d-cache < 22% for i-cache

0% 20% 40% 60% 80% 100% 1 10 100 1000 10000 100000 Window size (cycles) Fraction of subarrays touched in a window

slide-10
SLIDE 10

Temporal Locality in Subarrays

In 100-cycle window,

>95% of cache accesses reuse < 30% of subarrays

Most accesses temporally localized in small # of subarrays

slide-11
SLIDE 11

Decay counter per subarray [Kaxiras, et al.] Threshold value to decide “when” to precharge Algorithm

if count < threshold

precharge

if count > threshold

no precharge

Gated Precharging: Hardware

Precharge Control Counter Comp Threshold Cache Access reset CLK Subarray

slide-12
SLIDE 12

Gated Precharging: Overhead

Minimal performance overhead

Hits on idle subarrays incur 1 extra cycle Infrequent due to temporal locality

(Example: gcc < 8% d-cache accesses) Minimal energy overhead

10-bit counter per subarray Comparison logic Existing precharge control logic

slide-13
SLIDE 13

Related Work

Delayed precharging [Alpha 21264]

Precharge only required subarrays Increase cache access latency by delaying precharge

Resizable caches [Albonesi] [Yang, et al.]

Capture working set size variation & resize caches Coarse switching granularity (time & space) Relatively larger performance overhead

Way prediction [Powell, et al., Inoue, et al.]

Predict set associative way for next access Orthogonal to gated precharging

slide-14
SLIDE 14

Methodology

Wattch [ISCA2000]

16 SPEC2000/Olden benchmarks Performance impact < 2%

Base Case

8-wide issue, 128-entry RUU 32K direct-mapped L1 I & D w/ 1K-subarray 512K 4-way unified L2

Determine threshold values based on profiling

Threshold values ≅ 100 cycles

slide-15
SLIDE 15

Results: D-Cache

Reduced

On average by 89% by >85% for all but vpr

0% 10% 20% 30% 40% 50% a m m p a r t b h b i s

  • r

t b z i p 2 e m 3 d e q u a k e g c c h e a l t h m c f m e s a t r e e a d d t s p v

  • r

t e x v p r w u p w i s e

Fraction of subarray precharge

slide-16
SLIDE 16

Results: I-Cache

Reduced

On average by 92% by >90% for 13 benchmarks

0% 10% 20% 30% 40% 50% a m m p a r t b h b i s

  • r

t b z i p 2 e m 3 d e q u a k e g c c h e a l t h m c f m e s a t r e e a d d t s p v

  • r

t e x v p r w u p w i s e

Fraction of subarray precharges

slide-17
SLIDE 17

Conclusions

High bitline leakage in deep submicron caches Energy wasted in idle subarrays Gated precharging

Exploits temporal locality in subarrays Reduces 90% of precharging With < 2% performance impact

slide-18
SLIDE 18

For more information

PowerTap Project http://www.ece.cmu.edu/~powertap Computer Architecture Lab Carnegie Mellon University