Highly-Associative Caches for Low-Power Processors - - PDF document

highly associative caches for low power processors
SMART_READER_LITE
LIVE PREVIEW

Highly-Associative Caches for Low-Power Processors - - PDF document

Highly-Associative Caches for Low-Power Processors Motivation n Cache uses 30-60%


slide-1
SLIDE 1

Highly-Associative Caches for Low-Power Processors

  • Motivation

n Cache uses 30-60% processor energy in embedded systems.

  • Example: 43% for StrongArm-1

n Many academic studies on cache

l [Albera, Bahar, ’98] – Power and performance trade-offs l [Amrutur, Horowitz, ‘98,’00] – Speed and power scaling l [Bellas, Hajj, Polychronopoulos, ’99] – Dynamic cache management l [Ghose, Kamble,’99] – Power reduction through sub-banking, etc. l [Inoue, Ishihara, Murakami,’99] – Way predicting set-associative cache l [Kin,Gupta, Mangione-Smith, ’97] – Filter cache l [Ko, Balsara, Nanda, ’98] – Multilevel caches for RISC and CISC l [Wilton, Jouppi, ’94] – CACTI cache model

n Many Industrial Low-Power Processors use CAM (content-

addressable-memory)

  • ARM3 – 64-way set-associative – [Furber et. al. ’89]
  • StrongArm – 32-way set-associative – [Santhanam et. al. ’98]
  • Intel XScale – 32-way set-associative – ’01

n CAM: Fast and Energy-Efficient

slide-2
SLIDE 2

Talk Outline

Structural Comparison Area and Delay Comparison Energy Comparison Related work Conclusion

Set-Associative RAM-tag Cache

n Not energy-efficient

  • All ways are read out

n Two-phase approach

  • More energy-efficient
  • 2X latency

" "

7DJ6WDWXV'DWD 7DJ6WDWXV'DWD 7DJ,QGH[2IIVHW

slide-3
SLIDE 3

Set-Associative RAM-tag Sub-bank

n Not energy-efficient

  • All ways are read out

n Two-phase approach

  • More energy-efficient
  • 2X latency

n Sub-banking n 1 sub-bank = 1 way n Low-swing Bitlines

  • Only for reads, writes

performed full-swing

n Wordline Gating

,2 %86 DGGU

$ G G U H V V

  • '

H F R G H U

JZO OZO 2IIVHW 'HF RIIVHW 'DWD 65$0 &HOOV 6HQVH $PSV OZO 2IIVHW 'HF RIIVHW 'DWD 65$0 &HOOV 6HQVH $PSV

  • 7DJ

65$0 &HOOV 7DJ &RPS

%86 &DFKH

CAM-tag Cache

n Only one sub-bank

activated

n Associativity within

sub-bank

n Easy to implement

high associativity

7DJ6WDWXV'DWD

+,7"

:RUG

7DJ 2IIVHW %DQN

7DJ6WDWXV'DWD

+,7" +,7"

slide-4
SLIDE 4

CAM-tag Cache Sub-bank

n Only one sub-bank

activated

n Associativity within

sub-bank

n Easy to implement

high associativity

,2 %86 WDJ & $

  • W

D J

  • $

U U D \ JZO OZO 2IIVHW 'HF RIIVHW 65$0 &HOOV 6HQVH $PSV OZO 2IIVHW 'HF RIIVHW 65$0 &HOOV 6HQVH $PSV

  • CAM Functionality and Energy Usage

:/ %LW %LWBE 6%LW 6%LWBE PDWFK

7&$0&HOO :LWK6HSDUDWH :ULWH6HDUFK/LQHV $QG/RZ6ZLQJ 0DWFK/LQH

:/ %LW %LWBE 6%LW 6%LWBE PDWFK

0DWFK

  • :/

%LW %LWBE 6%LW 6%LWBE PDWFK

0LVPDWFK

  • n CAM Energy Dissipation
  • Search Lines
  • Match Lines
  • Drivers

6 5 $ ;25

slide-5
SLIDE 5

CAM-tag Cache Sub-bank Layout

10% area overhead over RAM-tag cache

[[&$0$UUD\

.%&DFKH6XEEDQNLPSOHPHQWHGLQµP&026WHFKQRORJ\ [5$0$UUD\

Delay Comparison

JZO *OREDO:RUGOLQH'HFRGLQJ OZO 'HFRGHGRIIVHW /RFDO:RUGOLQH'HFRGLQJ 7DJUHDGRXW 'DWDUHDGRXW ,QGH[%LWV 7DJELWV 'DWDRXW 7DJELWV 7DJ&RPS 'DWDRXW

5$0WDJ&DFKH &ULWLFDO3DWK &$0WDJ&DFKH &ULWLFDO3DWK

  • 7DJELWVEURDGFDVWLQJ

7DJELWV 7DJ&RPS JZO 'DWDUHDGRXW /RFDO:RUGOLQH'HFRGLQJ OZO 'HFRGHGRIIVHW

slide-6
SLIDE 6

Hit Energy Comparison

+ L W

  • (

Q H U J \

  • S

H U

  • $

F F H V V

  • I

R U

  • .

%

  • &

D F K H

  • L

Q

  • S
  • 50

100 150 200 250 300 350 400 450

1-way RAM 2-way RAM 4-way RAM 8-way RAM 8-way CAM 16-way CAM 32-way CAM

LZW ijpeg pegwit perl m88ksim gcc Avg

$VVRFLDWLYLW\DQG,PSOHPHQWDWLRQ

Miss Rate Results

5 10 15 20 25 1-way 2-way 4-way 8-way 16-way 32-way 64-way

/=:

0.5 1 1.5 2 2.5 3 3.5 1-way 2-way 4-way 8-way 16-way 32-way 64-way

LMSHJ

0.5 1 1.5 2 2.5 3 3.5 1-way 2-way 4-way 8-way 16-way 32-way 64-way

PNVLP

2 4 6 8 10 12 14 16 1-way 2-way 4-way 8-way 16-way 32-way 64-way

8KB 16KB

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1-way 2-way 4-way 8-way 16-way 32-way 64-way 1 2 3 4 5 6 1-way 2-way 4-way 8-way 16-way 32-way 64-way

SHUO SHJZLW JFF

slide-7
SLIDE 7

Total Access Energy (pegwit)

3HJZLW +LJKPLVVUDWHIRUKLJKDVVRFLDWLYLW\

7 R W D O

  • (

Q H U J \

  • S

H U

  • $

F F H V V

  • I

R U

  • .

%

  • &

D F K H

  • L

Q

  • S
  • 0LVV(QHUJ\([SUHVVHGLQ0XOWLSOHVRIELW5HDG$FFHVV(QHUJ\

500 1000 1500 2000 2500 32X 64X 128X 256X 512X 1024X

1-RAM 2-RAM 4-RAM 8-RAM 8-CAM 16-CAM 32-CAM

Total Access Energy (perl)

3HUO 9HU\ORZPLVVUDWHIRUKLJKDVVRFLDWLYLW\

7 R W D O

  • (

Q H U J \

  • S

H U

  • $

F F H V V

  • I

R U

  • .

%

  • &

D F K H

  • L

Q

  • S
  • 0LVV(QHUJ\([SUHVVHGLQ0XOWLSOHVRIELW5HDG$FFHVV(QHUJ\

50 100 150 200 250 300 350 400 450 500 32X 64X 128X 256X 512X 1024X

1-RAM 2-RAM 4-RAM 8-RAM 8-CAM 16-CAM 32-CAM

slide-8
SLIDE 8

Other Advantages of CAM-tag

Hit signal generated earlier

Simplifies pipelines

Simplified store operation

Wordline only enabled during a hit Stores can happen in a single cycle No write buffer necessary

Related Work

CACTI and CACTI2

  • [Wilton and Jouppi ’94],[Reinman and Jouppi, ’99]
  • Accurate delay and energy estimate

l Results within 10%

  • Energy estimate not suited for low-power designs
  • Typical Low-power features not included in CACTI

l Sub-banking l Low-swing bitlines l Wordline gating l Separate CAM search line l Low-swing match lines

  • Energy Estimation 10X greater than our model for one

CAM-tag cache sub-bank

l Our results closely agree with [Amruthur and Horowitz, 98]

slide-9
SLIDE 9

Conclusion

CAM tags – high performance and low-power

Energy consumption of 32-way CAM < 2-way RAM Easy to implement highly-associative tags Low area overhead (10%) Comparable access delay Better CPI by reducing miss rate

Thank You!