Improving Direct-Mapped Cache Performance by the Addition of a - PowerPoint PPT Presentation

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers Norm Jouppi

In Context • At the time, CPU performance was really beginning to pull away from DRAM performance • Increased interest in memory system performance • Mark Hill had just introduced the 4-Cs as way of categorizing cache misses • Conflict • Compulsory • Capacity • Coherence • Single-chip processors were really coming into their own • Increased pressure on area, so direct-mapped caches were very desirable. • It was also before the “Quantitive Approach” 2

Goals • Increase effectiveness of direct-mapped caches without spending much area. • In the retrospective, Norm says he was looking at each class of miss individually. 3

Motivation 4

Idea 1: Miss Buffers • When you fill a line, store a second copy in the miss buffer. • If you need the data again, it’ll be close at hand. 5

Miss Buffer Performance 6

Miss Buffer Gains • The miss buffer only address conflict misses • So it does better when there’s lots of them 7

Problems w/ the Miss Buffer • It wastes space • It’s contents are always replicated in the cache • It needs to be at least two entries to have any benefit. • If the conflicting area is larger than the miss buffer, the miss buffer is of no use. • we should be able to get some benefit from it, since it is extra space • The miss buffer is sort of pessimistic. It assumes that we are going to have a conflict on the data. • Let’s be optimistic. 8

The Victim Cache • Similar to the miss cache, but only put data in the victim cache when there’s actually a miss. 9

Victim Buffer Gains 10

Interesting Metrics 11

Fractional Associativity • Norm mentions the notion of fractional associativity. • You can think of a victim buffer as adding additional associativity to just the lines in the cache that need it. • Why pay for associativity everywhere, when it’s just a few problematic cache lines? 12

Victim Buffers Today • Victim buffers are very popular today, but not as Norm envisioned them. • Associativity is not prohibitively expensive. • In CMPs, cache inclusion makes less sense: • 256KB L2 • 8 cores = 16KB L1 D + I • L1 capacity is equal to L2 capacity • Inclusion is very wasteful -- everything is duplicated • Instead, use the L2 as shared victim buffer • Associative, but not full associative. 13

Address Compulsory and Capacity Misses • Fixing compulsory misses is tough: You must predict the future. • Previous techniques • Larger cache lines • Next line prefetcher 14

Simple Prefetching • Prefetch always • Always bring in the next line on every reference • Seems wasteful. • He says it’s not tractable, but that only applies to this system (maybe) • Prefetch on miss • Seems more reasonable. • Similar to doubling the cache line size • Can reduce misses by half. • Prefetch tagged • When a prefetched block is actually used, the next line is fetched. • Could reduce misses to zero, but waiting for the use is actually too late. • We need to get farther ahead in the access stream. That would require more space. 15

Stream buffers • The previous techniques waste cache space. • perhaps displacing other useful data • A stream buffer provides dedicated space for the prefetched data. 16

Stream Buffers • On a miss, start fetching successive lines • When they return, but them in the stream buffer • On future misses, check the head of the stream buffer, if it’s a hit, great! Fetch another line. • If it’s a miss, clear the stream buffer and start over. 17

Effectiveness • Great for instructions • Ok for data. 18

The problem with data • Programs often make interleaved, sequential streams of accesses • One stream buffer is not enough. • There is only one instruction stream, however. 19

Build Multiple Buffers 20

Stream Buffers today • Prefetching is very popular today • Prefetchers are very sophisticated, and very hard to reverse engineer and/or out-smart. • You need to disable them if you want to measure much of anything about your memory hierarchy. • You will design your own prefetcher later in the course. 21

Conclusions • Victim buffers and stream buffers are worthwhile • They can substantially reduce 3 of the 4 Cs • The paper says very little about how they would perform on a particular machine or how they should be provisioned. • It is all about trends and the underlying characteristics of the access stream that they exploit. • The hardware trade-offs are also important. 22

Improving Direct-Mapped Cache Performance by the Addition of a - PowerPoint PPT Presentation

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers Norm Jouppi In Context At the time, CPU performance was really beginning to pull away from DRAM performance Increased

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

EECS 373 Design of Microprocessor-Based Systems Memory-Mapped I/O Example Bus with Memory-Mapped

Embedded systems: Memory Mapped I/O Memory mapped I/O is a method of performing input/output

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Direct-Mapped Cache: Write Allocate with Write-Through Protocol Block size in bytes: B = 2 b WRITE

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

CSE378 - Cache Performance metrics for caches Parameters for cache design Basic performance

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Survivor: CSCI 135 Variables and data types Variables Stores information your program

8/27/15 World Variables in a programming Hello Hello World language n Variables store

Finishing Point and PointTest Formal vs. Actual Parameters formal parameter: in declaration of

Introduction to C Programming Basics of Programming (2) Variables Standard Input/Output (2)

Optimizing Throughput with Network Coding Zongpeng Li, Baochun Li Department of Electrical and

String Phenomenology: Type II/F-Theory Perspective Focus on particle physics & D-branes: I.

Plans for AIRS V6 Validation and Testing Eric Fetzer and Bill Irion Jet Propulsion Laboratory /

Challenging Times: A re-analysis of... Matthew Middleton, Tim Roberts, Chris Done, Floyd Jackson

Improving Direct-Mapped Cache Performance by the Addition of a - PowerPoint PPT Presentation

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers Norm Jouppi In Context At the time, CPU performance was really beginning to pull away from DRAM performance Increased

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

EECS 373 Design of Microprocessor-Based Systems Memory-Mapped I/O Example Bus with Memory-Mapped

Embedded systems: Memory Mapped I/O Memory mapped I/O is a method of performing input/output

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Direct-Mapped Cache: Write Allocate with Write-Through Protocol Block size in bytes: B = 2 b WRITE

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

CSE378 - Cache Performance metrics for caches Parameters for cache design Basic performance

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Survivor: CSCI 135 Variables and data types Variables Stores information your program

8/27/15 World Variables in a programming Hello Hello World language n Variables store

Finishing Point and PointTest Formal vs. Actual Parameters formal parameter: in declaration of

Introduction to C Programming Basics of Programming (2) Variables Standard Input/Output (2)

Optimizing Throughput with Network Coding Zongpeng Li, Baochun Li Department of Electrical and

String Phenomenology: Type II/F-Theory Perspective Focus on particle physics &amp; D-branes: I.

Plans for AIRS V6 Validation and Testing Eric Fetzer and Bill Irion Jet Propulsion Laboratory /

Challenging Times: A re-analysis of... Matthew Middleton, Tim Roberts, Chris Done, Floyd Jackson

String Phenomenology: Type II/F-Theory Perspective Focus on particle physics & D-branes: I.