CACHE POLICIES AND INTERCONNECTS Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

CACHE POLICIES AND INTERCONNECTS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture

Overview ¨ Upcoming deadline ¤ Feb. 3 rd : project group formation ¤ Note: email me once you form a group ¨ This lecture ¤ Cache replacement policies ¤ Cache partitioning ¤ Content aware optimizations ¤ Cache interconnect optimizations ¤ Encoding based optimizations

Recall: Cache Power Optimization ¨ Caches are power and performance critical components ¨ Performance Example: FX Processors ¤ Bridging the CPU-Mem gap ¨ Static power ¤ Large number of leaky cells ¨ Dynamic power ¤ Access through long interconnects [source: AMD]

Replacement Policies

Basic Replacement Policies ¨ Least Recently Used (LRU) LRU ¨ Least Frequently Used (LFU) A, A, B, X LFU ¨ Not Recently Used (NRU) ¤ every block has a bit that is reset to 0 upon touch ¤ a block with its bit set to 1 is evicted ¤ if no block has a 1, make every bit 1 ¨ Practical pseudo-LRU P-LRU MRU

Common Issues with Basic Policies ¨ Low hit rate due to cache pollution ¤ streaming (no reuse) n A-B-C-D-E-F-G-H-I-… ¤ thrashing (distant reuse) n A-B-C-A-B-C-A-B-C-… ¨ A large fraction of the cache is useless – blocks that have serviced their last hit and are on the slow walk from MRU to LRU

Basic Cache Policies ¨ Insertion ¤ Where is incoming line placed in replacement list? ¨ Promotion ¤ When a block is touched, it can be promoted up the priority list in one of many ways ¨ Victim selection ¤ Which line to replace for incoming line? (not necessarily the tail of the list) Simple changes to these policies can greatly improve cache performance for memory-intensive workloads

Inefficiency of Basic Policies ¨ About 60% of the cache blocks may be dead on arrival (DoA) [Qureshi’07]

Adaptive Insertion Policies ¨ MIP: MRU insertion policy (baseline) ¨ LIP: LRU insertion policy MRU LRU a b c d e f g h Traditional LRU places ‘i’ in MRU position. i a b c d e f g LIP places ‘i’ in LRU position; with the first touch it becomes MRU. a b c d e f g i [Qureshi’07]

Adaptive Insertion Policies ¨ LIP does not age older blocks LRU MRU ¤ A, A, B, C, B, C, B, C, … ¨ BIP: Bimodal Insertion Policy ¤ Let e = Bimodal throttle parameter if ( rand() < e ) Insert at MRU position; else Insert at LRU position; [Qureshi’07]

Adaptive Insertion Policies ¨ There are two types of workloads: LRU-friendly or BIP-friendly ¨ DIP: Dynamic Insertion Policy ¤ Set Dueling miss LRU-sets + n-bit cntr – BIP-sets miss MSB = 0? No YES Read the paper for more details. Use LRU Use BIP Follower Sets monitor è choose è apply [Qureshi’07] (using a single counter)

Adaptive Insertion Policies ¨ DIP reduces average MPKI by 21% and requires less than two bytes storage overhead [Qureshi’07]

Re-Reference Interval Prediction ¨ Goal: high performing scan resistant policy ¤ DIP is thrash-resistance ¤ LFU is good for recurring scans ¨ Key idea: insert blocks near the end of the list than at the very end ¨ Implement with a multi-bit version of NRU ¤ zero counter on touch, evict block with max counter, else increment every counter by one Read the paper for more details. [Jaleel’10]

Shared Cache Problems ¨ A thread’s performance may be significantly reduced due to an unfair cache sharing ¨ Question: how to control cache sharing? ¤ Fair cache partitioning [Kim’04] ¤ Utility based cache partitioning [Qureshi’06] Core 1 Core 2 Shared Cache

Utility Based Cache Partitioning ¨ Key idea: give more cache to the application that benefits more from cache equake Misses per 1000 instructions (MPKI) vpr UTIL LRU [Qureshi’06]

Utility Based Cache Partitioning PA UMON2 UMON1 I$ I$ Shared Core1 Core2 L2 cache D$ D$ Main Memory Three components: q Utility Monitors (UMON) per core q Partitioning Algorithm (PA) q Replacement support to enforce partitions [Qureshi’06]

Highly Associative Caches ¨ Last level caches have ~32 ways in multicores ¤ Increased energy, latency, and area overheads [Sanchez’10]

Recall: Victim Caches ¨ Goal: to decrease conflict misses using a small FA cache Can we reduce the hardware overheads? Data Last Level Cache Victim Cache 4-way SA Cache Small FA cache … …

The ZCache ¨ Goal: design a highly associative cache with a low number of ways ¨ Improves associativity by increasing number of replacement candidates ¨ Retains low energy/hit, latency and area of caches with few ways ¨ Skewed associative cache: each way has a different indexing function (in essence, W direct-mapped caches) [Sanchez’10]

The ZCache ¨ When block A is brought in, it could replace one of four (say) blocks B, C, D, E; but B could be made to reside in one of three other locations (currently occupied by F, G, H); and F could be moved to one of three other locations Read the paper for more details. [Sanchez’10]

Content Aware Optimizations

Dynamic Zero Compression ¨ More than 70% of the bits in data cache accesses are 0s 128 32 gwl Example of a small cache lwl lwl Address Decoder Tag Data Data SRAM SRAM SRAM Cells Cells Cells Offset Tag Sense Sense Offset Decoder Comp Decoder Amps Amps addr offset offset I/O BUS [Villa’00]

Dynamic Zero Compression ¨ Zero Indicator Bit; one bit per grouping of bits; set if bits are zeros; controls wordline gating Address-controlled Data-Controlled Address Decoder lwl lwl ZIB SRAM SRAM Cells Cells off dec Sns Amp SnsAmp addr I/O BUS [Villa’00]

Dynamic Zero Compression ¨ Data cache bitline swing reduction 50 word half-word byte half-byte 40 30 20 10 0 x n c n n n p i g o k c l e c e e e g l r e m e g 8 c e e d i i e e e v d d d p p t g _ A p 8 p _ _ _ _ _ _ r e _ o e j o m m m 1 g t 1 g t i n c i i v w 2 e w u 2 e c c 7 7 p g p p g p -10 g m g m e d d e p a p a [Villa’00]

Dynamic Zero Compression ¨ Data cache energy savings 45 40 35 30 25 20 15 10 5 0 comp li ijpeg go vortex m88k gcc perl adpcm_en adpcm_de epic unepic g721_en g721_de mpeg_en mpeg_de pegwit_en pegwit_de Avg [Villa’00]

CACHE POLICIES AND INTERCONNECTS Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

CACHE POLICIES AND INTERCONNECTS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline Feb. 3 rd : project group formation Note: email me

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

1/5/2012 Overview of Interconnects Presentation Outline Myrinet and Quadrics General

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Policies Philipp Koehn 21 October 2019 Philipp Koehn Computer Systems Fundamentals: Cache

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

Write Through No Write Allocate Cache Write Reference Check tag and index Yes Tag AND

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

McShane-Whitney extensions and the Hahn-Banach theorem Iosif Petrakis

One-Slide Summary List Writing recursive functions that operate on Recursion: recursive data

Area Forum Meetings July 2013 Local Implementation Plan 2011-2031 Our approved transport

Lecture 20 Spatial Random Effects Models + Point Reference Spatial Data Colin Rundel 04/03/2017

Correctly Rounded Arbitrary-Precision Floating-Point Summation Vincent LEFVRE AriC, Inria

Based on Community Oral Health (Pine ) Essential Dental Public Health(Daly) By Dr. Asgari &

Generating Disambiguating Paraphrases for Structurally Ambiguous Sentences Manjuan Duan, Ethan

microRNA33 targets the ABCA1 pump and regulates cholesterol metabolism Laure-Alix Clerbaux

CACHE POLICIES AND INTERCONNECTS Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

CACHE POLICIES AND INTERCONNECTS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline Feb. 3 rd : project group formation Note: email me

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

1/5/2012 Overview of Interconnects Presentation Outline Myrinet and Quadrics General

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Policies Philipp Koehn 21 October 2019 Philipp Koehn Computer Systems Fundamentals: Cache

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

Write Through No Write Allocate Cache Write Reference Check tag and index Yes Tag AND

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

McShane-Whitney extensions and the Hahn-Banach theorem Iosif Petrakis

One-Slide Summary List Writing recursive functions that operate on Recursion: recursive data

Area Forum Meetings July 2013 Local Implementation Plan 2011-2031 Our approved transport

Lecture 20 Spatial Random Effects Models + Point Reference Spatial Data Colin Rundel 04/03/2017

Correctly Rounded Arbitrary-Precision Floating-Point Summation Vincent LEFVRE AriC, Inria

Based on Community Oral Health (Pine ) Essential Dental Public Health(Daly) By Dr. Asgari &amp;

Generating Disambiguating Paraphrases for Structurally Ambiguous Sentences Manjuan Duan, Ethan

microRNA33 targets the ABCA1 pump and regulates cholesterol metabolism Laure-Alix Clerbaux

Based on Community Oral Health (Pine ) Essential Dental Public Health(Daly) By Dr. Asgari &