CACHE POLICIES AND INTERCONNECTS Mahdi Nazm Bojnordi Assistant - - PowerPoint PPT Presentation

cache policies and interconnects
SMART_READER_LITE
LIVE PREVIEW

CACHE POLICIES AND INTERCONNECTS Mahdi Nazm Bojnordi Assistant - - PowerPoint PPT Presentation

CACHE POLICIES AND INTERCONNECTS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline Feb. 3 rd : project group formation Note: email me


slide-1
SLIDE 1

CACHE POLICIES AND INTERCONNECTS

CS/ECE 7810: Advanced Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor School of Computing University of Utah

slide-2
SLIDE 2

Overview

¨ Upcoming deadline

¤ Feb. 3rd: project group formation ¤ Note: email me once you form a group

¨ This lecture

¤ Cache replacement policies ¤ Cache partitioning ¤ Content aware optimizations ¤ Cache interconnect optimizations ¤ Encoding based optimizations

slide-3
SLIDE 3

Recall: Cache Power Optimization

¨ Caches are power and performance critical

components

¨ Performance ¤ Bridging the CPU-Mem gap ¨ Static power ¤ Large number of leaky cells ¨ Dynamic power ¤ Access through long interconnects

[source: AMD]

Example: FX Processors

slide-4
SLIDE 4

Replacement Policies

slide-5
SLIDE 5

Basic Replacement Policies

¨ Least Recently Used (LRU) ¨ Least Frequently Used (LFU) ¨ Not Recently Used (NRU)

¤ every block has a bit that is reset to 0 upon touch ¤ a block with its bit set to 1 is evicted ¤ if no block has a 1, make every bit 1

¨ Practical pseudo-LRU

A, A, B, X LRU LFU MRU P-LRU

slide-6
SLIDE 6

Common Issues with Basic Policies

¨ Low hit rate due to cache pollution

¤ streaming (no reuse)

n A-B-C-D-E-F-G-H-I-…

¤ thrashing (distant reuse)

n A-B-C-A-B-C-A-B-C-… ¨ A large fraction of the cache is useless – blocks that

have serviced their last hit and are on the slow walk from MRU to LRU

slide-7
SLIDE 7

Basic Cache Policies

¨ Insertion

¤ Where is incoming line placed in replacement list?

¨ Promotion

¤ When a block is touched, it can be promoted up the

priority list in one of many ways

¨ Victim selection

¤ Which line to replace for incoming line? (not necessarily

the tail of the list)

Simple changes to these policies can greatly improve cache performance for memory-intensive workloads

slide-8
SLIDE 8

Inefficiency of Basic Policies

¨ About 60% of the cache blocks may be dead on

arrival (DoA)

[Qureshi’07]

slide-9
SLIDE 9

Adaptive Insertion Policies

¨ MIP: MRU insertion policy (baseline) ¨ LIP: LRU insertion policy

[Qureshi’07]

a b c d e f g h

MRU LRU

i a b c d e f g

Traditional LRU places ‘i’ in MRU position.

a b c d e f g i

LIP places ‘i’ in LRU position; with the first touch it becomes MRU.

slide-10
SLIDE 10

Adaptive Insertion Policies

¨ LIP does not age older blocks

¤ A, A, B, C, B, C, B, C, …

¨ BIP: Bimodal Insertion Policy

¤ Let e = Bimodal throttle parameter [Qureshi’07] LRU MRU

if ( rand() < e ) Insert at MRU position; else Insert at LRU position;

slide-11
SLIDE 11

Adaptive Insertion Policies

¨ There are two types of workloads: LRU-friendly or

BIP-friendly

¨ DIP: Dynamic Insertion Policy

¤ Set Dueling [Qureshi’07]

LRU-sets Follower Sets BIP-sets n-bit cntr + miss – miss MSB = 0?

YES No Use LRU Use BIP

monitor è choose è apply

(using a single counter)

Read the paper for more details.

slide-12
SLIDE 12

Adaptive Insertion Policies

¨ DIP reduces average MPKI by 21% and requires

less than two bytes storage overhead

[Qureshi’07]

slide-13
SLIDE 13

Re-Reference Interval Prediction

¨ Goal: high performing scan resistant policy

¤ DIP is thrash-resistance ¤ LFU is good for recurring scans

¨ Key idea: insert blocks near the end of the list than

at the very end

¨ Implement with a multi-bit version of NRU

¤ zero counter on touch, evict block with max counter, else

increment every counter by one

[Jaleel’10] Read the paper for more details.

slide-14
SLIDE 14

Shared Cache Problems

¨ A thread’s performance may be significantly

reduced due to an unfair cache sharing

¨ Question: how to control cache sharing?

¤ Fair cache partitioning [Kim’04] ¤ Utility based cache partitioning [Qureshi’06] Shared Cache Core 1 Core 2

slide-15
SLIDE 15

Utility Based Cache Partitioning

¨ Key idea: give more cache to the application that

benefits more from cache

[Qureshi’06]

Misses per 1000 instructions (MPKI)

equake vpr LRU UTIL

slide-16
SLIDE 16

Three components: q Utility Monitors (UMON) per core q Partitioning Algorithm (PA) q Replacement support to enforce partitions I$ D$ Core1 I$ D$ Core2 Shared L2 cache Main Memory UMON1 UMON2 PA

Utility Based Cache Partitioning

[Qureshi’06]

slide-17
SLIDE 17

Highly Associative Caches

¨ Last level caches have ~32 ways in multicores

¤ Increased energy, latency, and area overheads [Sanchez’10]

slide-18
SLIDE 18

Recall: Victim Caches

¨ Goal: to decrease conflict misses using a small FA

cache

… Last Level Cache 4-way SA Cache … Victim Cache Small FA cache Data Can we reduce the hardware overheads?

slide-19
SLIDE 19

The ZCache

¨ Goal: design a highly associative cache with a low

number of ways

¨ Improves associativity by increasing number of

replacement candidates

¨ Retains low energy/hit, latency and area of caches

with few ways

¨ Skewed associative cache: each way has a different

indexing function (in essence, W direct-mapped caches)

[Sanchez’10]

slide-20
SLIDE 20

The ZCache

¨ When block A is brought in, it could replace one of

four (say) blocks B, C, D, E; but B could be made to reside in one of three other locations (currently

  • ccupied by F, G, H); and F could be moved to one
  • f three other locations

[Sanchez’10] Read the paper for more details.

slide-21
SLIDE 21

Content Aware Optimizations

slide-22
SLIDE 22

Dynamic Zero Compression

¨ More than 70% of the bits in data cache accesses

are 0s

I/O BUS addr Address Decoder gwl lwl Offset Decoder

  • ffset

Data SRAM Cells Sense Amps lwl Offset Decoder

  • ffset

Data SRAM Cells Sense Amps 32 128 Tag SRAM Cells Tag Comp

[Villa’00] Example of a small cache

slide-23
SLIDE 23

Dynamic Zero Compression

¨ Zero Indicator Bit; one bit per grouping of bits; set

if bits are zeros; controls wordline gating

[Villa’00]

I/O addr Address Decoder lwl SRAM Cells SnsAmp

  • ff dec

Address-controlled

BUS lwl SRAM Cells Sns Amp

ZIB

Data-Controlled

slide-24
SLIDE 24

Dynamic Zero Compression

¨ Data cache bitline swing reduction

[Villa’00]

  • 10

10 20 30 40 50 c

  • m

p l i i j p e g g

  • v
  • r

t e x m 8 8 k g c c p e r l a d p c m _ e n a d p c m _ d e e p i c u n e p i c g 7 2 1 _ e n g 7 2 1 _ d e m p e g _ e n m p e g _ d e p e g w i t _ e n p e g w i t _ d e A v g

word half-word byte half-byte

slide-25
SLIDE 25

Dynamic Zero Compression

¨ Data cache energy savings

[Villa’00]

5 10 15 20 25 30 35 40 45

comp li ijpeg go vortex m88k gcc perl adpcm_en adpcm_de epic unepic g721_en g721_de mpeg_en mpeg_de pegwit_en pegwit_de Avg