Slides for Lecture 9 ENCM 501: Principles of Computer Architecture - PowerPoint PPT Presentation

Slides for Lecture 9 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 6 February, 2014

/23 ENCM 501 W14 Slides for Lecture 9 Previous Lecture ◮ completion of DRAM coverage ◮ introduction to caches

/23 ENCM 501 W14 Slides for Lecture 9 Today’s Lecture ◮ continued coverage of cache design and cache performance Related reading in Hennessy & Patterson: Sections B.1–B.3.

/23 ENCM 501 W14 Slides for Lecture 9 Review: Example computer with only one level of cache and no virtual memory L1 I- DRAM CONTROLLER CACHE DRAM CORE MODULES L1 D- CACHE We’re looking at this simple system because it helps us to think about cache design and performance issues while avoiding the complexity of real systems like the Intel i7 shown in textbook Figure 2.21.

/23 ENCM 501 W14 Slides for Lecture 9 Review: Example data cache organization way 0 way 1 set 0 8-to-256 decoder set 1 set 2 index . . . . . . . . . . . . . . . 8 set 253 set 254 set 255 block status: 1 valid bit and 1 dirty bit per block Key tag: 1 18-bit stored tag per block data: 64-byte (512-bit) data block set status: 1 LRU bit per set This could fit within our simple example hierarchy, but is also not much different from some current L1 D-cache designs.

/23 ENCM 501 W14 Slides for Lecture 9 Review: Completion of a read hit in the example cache See the previous lecture for details of hit detection. We supposed that the data size requested for a read at address 0x1001abc4 is word . The address split was 0001 0000 0000 0001 10 10 1011 11 00 0100 What updates happen in the cache, and what information goes back to the core? Answer: In the way where the hit did happen, the index (10101111 two = 175 ten ), the block offset and data size are used to find a 32-bit word within a data block to copy into the core. The LRU bit for set 175 ten is updated to equal the number of the way where the hit did not happen.

/23 ENCM 501 W14 Slides for Lecture 9 How did the stored tags get into the D-cache? And how did V-bit values go from 0 to 1? These are good questions. The answer to both is that stored tags and valid data blocks get into a cache as a result of misses . The story of cache access is being told starting not at the beginning, when a program is launched, but in the middle, after a program has been running for a while, and some of the program’s data has already been copied from main memory into the D-cache. The reason for telling the story this way is that hits are easier to describe (and, we hope, much more frequent) than misses.

/23 ENCM 501 W14 Slides for Lecture 9 Completion of a write hit in the example cache As with a read, this only happens if the result of hit detection was a hit. To handle writes, this particular cache uses a write-back strategy, in which write hits update the cache, and do not cause updates in the next level of the memory hierarchy. Let’s suppose that the data size requested for a write at address 0x1001abc4 is word . What updates happen in the cache, and what information goes back to the core?

/23 ENCM 501 W14 Slides for Lecture 9 Completion of a read miss in the example cache A read miss means that the data the core was seeking is not in the cache. Again suppose that the data size requested for a read at address 0x1001abc4 is word . What will happen to get the core the data it needs? What updates are needed to make sure that the cache will behave efficiently and correctly in the future? Important: Make sure you know why it is absolutely not good enough to copy only one 32-bit word from DRAM to the cache!

/23 ENCM 501 W14 Slides for Lecture 9 Completion of a write miss in the example cache A write miss means that the memory contents the core wanted to update are not currently reflected in the cache. Again suppose that the data size requested for a write at address 0x1001abc4 is word . What will happen to complete the write? What updates are needed to make sure that the cache will behave efficiently and correctly in the future?

/23 ENCM 501 W14 Slides for Lecture 9 Storage cells in the example cache Data blocks are implemented with SRAM cells. The design tradeoffs for the cell design relate to speed, chip area, and energy use per read or write. Tags and status bits might be SRAM cells or might be CAM (“content addressable memory”) cells. A CMOS CAM cell uses the same 6-transistor structure as an SRAM cell for reads, writes, and holding a stored bit for long periods of time during which there are neither reads nor writes. A CAM cell also has 3 or 4 extra transistors that help in determining whether the bit pattern in a group of CAM cells (e.g., a stored tag) matches some other bit pattern (e.g., a search tag).

/23 ENCM 501 W14 Slides for Lecture 9 CAM cells organized to make a J-bit stored tag BL J–1 BL J–1 BL J–2 BL 0 BL 0 BL J–2 WL i . . . CAM CAM CAM CELL CELL CELL MATCH i . . . For reads or writes the wordline and bitlines plays the same roles they do in a row of an SRAM array. To check for a match, the wordline is held LOW , and the search tag is applied to the bitlines. If every search tag bit matches the corresponding stored tag bit, the matchline stays HIGH ; if there is even a single mismatch, the matchline goes LOW .

/23 ENCM 501 W14 Slides for Lecture 9 CAM cells versus SRAM cells for stored tags With CAM cells, tag comparisons can be done in place. With SRAM cells, stored tags would have to be read via bitlines to comparator circuits outside the tag array, which is a slower process. (Schematics of caches in the textbook tend to show tag comparison done outside of tag arrays, but that is likely done to show that a comparison is needed, not to indicate physical design.) CAM cells are larger than SRAM cells. But the total area needed for CAM-cell tag arrays will still be much smaller than the total area needed for SRAM data blocks. Would it make sense to use CAM cells for V (valid) bits?

/23 ENCM 501 W14 Slides for Lecture 9 Cache line is a synonym for cache block Hennessy and Patterson are fairly consistent in their use of the term cache block . A lot of other literature uses the term cache block . However, the term cache line , which means the same thing, is also in wide use. So you will probably read things like . . . ◮ “In a 4-way set-associative cache, an index finds a set containing 4 cache lines.” ◮ “In a direct-mapped cache, there is one cache line per index.” ◮ “A cache miss, even if it is for access to a single byte, will result in the transfer of an entire cache line.”

/23 ENCM 501 W14 Slides for Lecture 9 Direct-mapped caches A direct-mapped cache can be thought of as a special case of a set-associative cache, in which there is only one way . For a given cache capacity, a direct-mapped cache is easier to build than an N -way set-associative cache with N ≥ 2: ◮ no logic is required to find the correct way for data transfer after a hit is detected; ◮ no logic is needed to decide which block in a set to replace in handling a miss. Direct-mapped caches may also be faster and more energy-efficient. However, direct-mapped caches are vulnerable to index conflicts (sometimes called index collisions ).

/23 ENCM 501 W14 Slides for Lecture 9 Data cache index conflict example Consider this sketch of a C function: int g1, g2, g3, g4; void func(int *x, int n) { int loc[10], k; while ( condition ) { make accesses to g1, g2, g3, g4 and loc } } What will happen in the following scenario? ◮ the addresses of g1 to g4 are 0x0804 fff0 to 0x0804 fffc ◮ the address of loc[0] is 0xbfbd fff0

/23 ENCM 501 W14 Slides for Lecture 9 Instruction cache index conflict example Suppose a program spends much of its time in a loop within function f . . . void f(double *x, double *y, int n) { int k; for (k = 0; k < n; k++) y[k] = g(x[k]) + h(x[k]); } Suppose that g and h are small, simple functions that don’t call other functions. What kind of bad luck could cause huge numbers of misses in a direct-mapped instruction cache?

/23 ENCM 501 W14 Slides for Lecture 9 Motivation for set-associative caches (1) Qualitatively: ◮ In our example of data cache index conflicts, the conflicts go away if the cache is changed from direct-mapped to 2-way set-associative. ◮ In our example of instruction cache index conflicts, in the worst case, the conflicts go away if the cache is changed from direct-mapped to 4-way set-associative. Quantitatively, see Figure B.8 on page B-24 of the textbook. ◮ Conflict misses are a big problem in direct-mapped caches. ◮ Moving from direct-mapped to 2-way to 4-way to 8-way reduces the conflict miss rate at each step.

/23 ENCM 501 W14 Slides for Lecture 9 Motivation for set-associative caches (2) Detailed studies show that 4-way set-associativity is good enough to eliminate almost all conflict misses. But many practical cache designs are 8- or even 16-way set-associative. There must be reasons for this other than the desire to avoid conflict misses. We’ll come back to this question later.

Slides for Lecture 9 ENCM 501: Principles of Computer Architecture - PowerPoint PPT Presentation

Slides for Lecture 9 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 6 February, 2014 slide 2/23 ENCM 501 W14

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 6 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 27 ENEL 353: Digital Circuits Fall

Automated Placement for Custom Digital Designs Tung-Chieh Chen Physical Design Group, SpringSoft

Negotiability In Depth: Management Rights and Beyond August 17, 2017 Slide 2 Disagreement

Realistic Image Synthesis Bidirectional Path Tracing & Reciprocity Philipp Slusallek Karol

sRDMA Efficient NICbased Authentication and Encryption for Remote Direct Memory Access

CS4617 Computer Architecture Lecture 5: Memory Hierarchy 3 Dr J Vaughan September 22, 2014 1/37

Direct Assessment Yvette Graham August 11, 2016 Direct Assessment First Conference on Machine

Cache Performance and Set Associative Cache Lecture 12 CDA 3103 06-30-2014 5.1 Introduction

GMC MCB B St Stat atut utor ory y Aut uthority hority Vermont Information Technology

Slides for Lecture 9 ENCM 501: Principles of Computer Architecture - PowerPoint PPT Presentation

Slides for Lecture 9 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 6 February, 2014 slide 2/23 ENCM 501 W14

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 6 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 27 ENEL 353: Digital Circuits Fall

Automated Placement for Custom Digital Designs Tung-Chieh Chen Physical Design Group, SpringSoft

Negotiability In Depth: Management Rights and Beyond August 17, 2017 Slide 2 Disagreement

Realistic Image Synthesis Bidirectional Path Tracing &amp; Reciprocity Philipp Slusallek Karol

sRDMA Efficient NICbased Authentication and Encryption for Remote Direct Memory Access

CS4617 Computer Architecture Lecture 5: Memory Hierarchy 3 Dr J Vaughan September 22, 2014 1/37

Direct Assessment Yvette Graham August 11, 2016 Direct Assessment First Conference on Machine

Cache Performance and Set Associative Cache Lecture 12 CDA 3103 06-30-2014 5.1 Introduction

GMC MCB B St Stat atut utor ory y Aut uthority hority Vermont Information Technology

Realistic Image Synthesis Bidirectional Path Tracing & Reciprocity Philipp Slusallek Karol