Slides for Lecture 9 ENCM 501: Principles of Computer Architecture - - PowerPoint PPT Presentation

slides for lecture 9
SMART_READER_LITE
LIVE PREVIEW

Slides for Lecture 9 ENCM 501: Principles of Computer Architecture - - PowerPoint PPT Presentation

Slides for Lecture 9 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 6 February, 2014 slide 2/23 ENCM 501 W14


slide-1
SLIDE 1

Slides for Lecture 9

ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng

Electrical & Computer Engineering Schulich School of Engineering University of Calgary

6 February, 2014

slide-2
SLIDE 2

ENCM 501 W14 Slides for Lecture 9

slide 2/23

Previous Lecture

◮ completion of DRAM coverage ◮ introduction to caches

slide-3
SLIDE 3

ENCM 501 W14 Slides for Lecture 9

slide 3/23

Today’s Lecture

◮ continued coverage of cache design and cache

performance Related reading in Hennessy & Patterson: Sections B.1–B.3.

slide-4
SLIDE 4

ENCM 501 W14 Slides for Lecture 9

slide 4/23

Review: Example computer with only one level of cache and no virtual memory

CORE DRAM CONTROLLER DRAM MODULES CACHE L1 I- CACHE L1 D-

We’re looking at this simple system because it helps us to think about cache design and performance issues while avoiding the complexity of real systems like the Intel i7 shown in textbook Figure 2.21.

slide-5
SLIDE 5

ENCM 501 W14 Slides for Lecture 9

slide 5/23

Review: Example data cache organization

. . . . . . . . . Key . . . . . . set 0 set 1 set 2 set 253 set 254 set 255 8-to-256 decoder

8

index block status: 1 valid bit and 1 dirty bit per block tag: 1 18-bit stored tag per block data: 64-byte (512-bit) data block set status: 1 LRU bit per set way 1 way 0 This could fit within our simple example hierarchy, but is also not much different from some current L1 D-cache designs.

slide-6
SLIDE 6

ENCM 501 W14 Slides for Lecture 9

slide 6/23

Review: Completion of a read hit in the example cache

See the previous lecture for details of hit detection. We supposed that the data size requested for a read at address 0x1001abc4 is word. The address split was 0001 0000 0000 0001 10 10 1011 11 00 0100 What updates happen in the cache, and what information goes back to the core? Answer: In the way where the hit did happen, the index (10101111two = 175ten), the block offset and data size are used to find a 32-bit word within a data block to copy into the

  • core. The LRU bit for set 175ten is updated to equal the

number of the way where the hit did not happen.

slide-7
SLIDE 7

ENCM 501 W14 Slides for Lecture 9

slide 7/23

How did the stored tags get into the D-cache?

And how did V-bit values go from 0 to 1? These are good questions. The answer to both is that stored tags and valid data blocks get into a cache as a result of misses. The story of cache access is being told starting not at the beginning, when a program is launched, but in the middle, after a program has been running for a while, and some of the program’s data has already been copied from main memory into the D-cache. The reason for telling the story this way is that hits are easier to describe (and, we hope, much more frequent) than misses.

slide-8
SLIDE 8

ENCM 501 W14 Slides for Lecture 9

slide 8/23

Completion of a write hit in the example cache

As with a read, this only happens if the result of hit detection was a hit. To handle writes, this particular cache uses a write-back strategy, in which write hits update the cache, and do not cause updates in the next level of the memory hierarchy. Let’s suppose that the data size requested for a write at address 0x1001abc4 is word. What updates happen in the cache, and what information goes back to the core?

slide-9
SLIDE 9

ENCM 501 W14 Slides for Lecture 9

slide 9/23

Completion of a read miss in the example cache

A read miss means that the data the core was seeking is not in the cache. Again suppose that the data size requested for a read at address 0x1001abc4 is word. What will happen to get the core the data it needs? What updates are needed to make sure that the cache will behave efficiently and correctly in the future? Important: Make sure you know why it is absolutely not good enough to copy only one 32-bit word from DRAM to the cache!

slide-10
SLIDE 10

ENCM 501 W14 Slides for Lecture 9

slide 10/23

Completion of a write miss in the example cache

A write miss means that the memory contents the core wanted to update are not currently reflected in the cache. Again suppose that the data size requested for a write at address 0x1001abc4 is word. What will happen to complete the write? What updates are needed to make sure that the cache will behave efficiently and correctly in the future?

slide-11
SLIDE 11

ENCM 501 W14 Slides for Lecture 9

slide 11/23

Storage cells in the example cache

Data blocks are implemented with SRAM cells. The design tradeoffs for the cell design relate to speed, chip area, and energy use per read or write. Tags and status bits might be SRAM cells or might be CAM (“content addressable memory”) cells. A CMOS CAM cell uses the same 6-transistor structure as an SRAM cell for reads, writes, and holding a stored bit for long periods of time during which there are neither reads nor writes. A CAM cell also has 3 or 4 extra transistors that help in determining whether the bit pattern in a group of CAM cells (e.g., a stored tag) matches some other bit pattern (e.g., a search tag).

slide-12
SLIDE 12

ENCM 501 W14 Slides for Lecture 9

slide 12/23

CAM cells organized to make a J-bit stored tag

BLJ–1 BLJ–2 BL0 . . . . . . CAM CELL WLi CAM CELL BLJ–2 CAM CELL BL0 MATCHi BLJ–1

For reads or writes the wordline and bitlines plays the same roles they do in a row of an SRAM array. To check for a match, the wordline is held LOW, and the search tag is applied to the bitlines. If every search tag bit matches the corresponding stored tag bit, the matchline stays

HIGH; if there is even a single mismatch, the matchline goes LOW.

slide-13
SLIDE 13

ENCM 501 W14 Slides for Lecture 9

slide 13/23

CAM cells versus SRAM cells for stored tags

With CAM cells, tag comparisons can be done in place. With SRAM cells, stored tags would have to be read via bitlines to comparator circuits outside the tag array, which is a slower process. (Schematics of caches in the textbook tend to show tag comparison done outside of tag arrays, but that is likely done to show that a comparison is needed, not to indicate physical design.) CAM cells are larger than SRAM cells. But the total area needed for CAM-cell tag arrays will still be much smaller than the total area needed for SRAM data blocks. Would it make sense to use CAM cells for V (valid) bits?

slide-14
SLIDE 14

ENCM 501 W14 Slides for Lecture 9

slide 14/23

Cache line is a synonym for cache block

Hennessy and Patterson are fairly consistent in their use of the term cache block. A lot of other literature uses the term cache block. However, the term cache line, which means the same thing, is also in wide use. So you will probably read things like . . .

◮ “In a 4-way set-associative cache, an index finds a set

containing 4 cache lines.”

◮ “In a direct-mapped cache, there is one cache line per

index.”

◮ “A cache miss, even if it is for access to a single byte, will

result in the transfer of an entire cache line.”

slide-15
SLIDE 15

ENCM 501 W14 Slides for Lecture 9

slide 15/23

Direct-mapped caches

A direct-mapped cache can be thought of as a special case of a set-associative cache, in which there is only one way. For a given cache capacity, a direct-mapped cache is easier to build than an N-way set-associative cache with N ≥ 2:

◮ no logic is required to find the correct way for data

transfer after a hit is detected;

◮ no logic is needed to decide which block in a set to

replace in handling a miss. Direct-mapped caches may also be faster and more energy-efficient. However, direct-mapped caches are vulnerable to index conflicts (sometimes called index collisions).

slide-16
SLIDE 16

ENCM 501 W14 Slides for Lecture 9

slide 16/23

Data cache index conflict example

Consider this sketch of a C function: int g1, g2, g3, g4; void func(int *x, int n) { int loc[10], k; while ( condition ) { make accesses to g1, g2, g3, g4 and loc } } What will happen in the following scenario?

◮ the addresses of g1 to g4 are 0x0804 fff0 to

0x0804 fffc

◮ the address of loc[0] is 0xbfbd fff0

slide-17
SLIDE 17

ENCM 501 W14 Slides for Lecture 9

slide 17/23

Instruction cache index conflict example

Suppose a program spends much of its time in a loop within function f . . . void f(double *x, double *y, int n) { int k; for (k = 0; k < n; k++) y[k] = g(x[k]) + h(x[k]); } Suppose that g and h are small, simple functions that don’t call other functions. What kind of bad luck could cause huge numbers of misses in a direct-mapped instruction cache?

slide-18
SLIDE 18

ENCM 501 W14 Slides for Lecture 9

slide 18/23

Motivation for set-associative caches (1)

Qualitatively:

◮ In our example of data cache index conflicts, the conflicts

go away if the cache is changed from direct-mapped to 2-way set-associative.

◮ In our example of instruction cache index conflicts, in the

worst case, the conflicts go away if the cache is changed from direct-mapped to 4-way set-associative. Quantitatively, see Figure B.8 on page B-24 of the textbook.

◮ Conflict misses are a big problem in direct-mapped

caches.

◮ Moving from direct-mapped to 2-way to 4-way to 8-way

reduces the conflict miss rate at each step.

slide-19
SLIDE 19

ENCM 501 W14 Slides for Lecture 9

slide 19/23

Motivation for set-associative caches (2)

Detailed studies show that 4-way set-associativity is good enough to eliminate almost all conflict misses. But many practical cache designs are 8- or even 16-way set-associative. There must be reasons for this other than the desire to avoid conflict misses. We’ll come back to this question later.

slide-20
SLIDE 20

ENCM 501 W14 Slides for Lecture 9

slide 20/23

Replacement strategies in set-associative caches

Let N be the number of ways. With N = 2, LRU replacement is easy to implement—a single bit in each set can track which block should be replaced on a miss in that set. Exact LRU replacement is harder to implement with N > 2—LRU status bits would have to somehow encode a list

  • f least-to-most-recent accesses within a set.

However, choice of replacement strategy between various reasonable options seems to have very little effect on miss rate—see Figure B.4 on page B-10 of the textbook. So we’re not going to study cache block replacement strategy in detail in ENCM 501.

slide-21
SLIDE 21

ENCM 501 W14 Slides for Lecture 9

slide 21/23

Fully-associative caches

A fully-associative cache can be thought of as an N-way set-associative cache in which N is equal to the number of blocks. In this way of thinking, how many sets are there in a fully-associative cache? What is the width of an index? For energy use, how would hit detection in a fully-associative cache compare with hit detection in a direct-mapped or (small N) set-associative cache with the same capacity?

slide-22
SLIDE 22

ENCM 501 W14 Slides for Lecture 9

slide 22/23

Different options for handling writes

See Q4: What Happens on a Write? on pages B-10 to B-12 of the textbook for details. There is not much I can put into lecture slides to improve on the clarity of that material. Note that the write hit and write miss examples given early in this lecture assume a write-back policy, using write allocate in the case of a write miss.

slide-23
SLIDE 23

ENCM 501 W14 Slides for Lecture 9

slide 23/23

Next week

Quiz #1: 12:30 to 1:10pm on Tuesday. It’s closed-book—you will be given any equations you need for particular problems. Calculators are allowed—Casio FX-260, Casio FX-300MS, or TI-30XIIS. The quiz will be followed by a 5-minute break and 30 minutes

  • f lecture.

Lecture material for Tuesday and Thursday:

◮ Completion of material on cache design and cache

performance.

◮ Virtual memory.

Related reading in Hennessy & Patterson: Sections B.2–B.5