SLIDE 1 Slides for Lecture 8
ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng
Electrical & Computer Engineering Schulich School of Engineering University of Calgary
4 February, 2014
SLIDE 2 ENCM 501 W14 Slides for Lecture 8
slide 2/29
Previous Lecture
◮ conditional branches in various ISAs ◮ introduction to memory systems ◮ review of SRAM and DRAM
SLIDE 3 ENCM 501 W14 Slides for Lecture 8
slide 3/29
Today’s Lecture
◮ more about DRAM ◮ introduction to caches
Related reading in Hennessy & Patterson: Sections B.1–B.2
SLIDE 4
ENCM 501 W14 Slides for Lecture 8
slide 4/29
The “1T” DRAM (Dynamic RAM) cell
BITLINE
Q
WORDLINE
The bit is stored as a voltage on a capacitor. A relatively high voltage at Q is a 1, and a relatively low voltage at Q is a 0. When the stored bit is a 1, charge is slowly leaking from node Q to ground. In a DRAM array, each row of cells must periodically be read and written back to strengthen the voltages in cells with stored 1’s—this is called refresh. DRAM gets the name dynamic from the continuing activity needed to keep the stored data valid.
SLIDE 5
ENCM 501 W14 Slides for Lecture 8
slide 5/29
Writing to a DRAM cell
BITLINE
Q
WORDLINE
Set BITLINE to the appropriate voltage for a 1 or a 0. Turn on WORDLINE. Q will take on the appropriate voltage.
SLIDE 6
ENCM 501 W14 Slides for Lecture 8
slide 6/29
Reading from a DRAM cell
BITLINE
Q
WORDLINE
Pre-charge BITLINE and some nearby electrically similar reference wire to the same voltage, somewhere between logic 0 and logic 1. Turn on WORDLINE. The cell will create a voltage difference between BITLINE and the reference wire, such that the difference can be reliably measured by a sense amplifier. Reading a DRAM cell destroys the data in the cell. After a read, the data must be written back.
SLIDE 7
ENCM 501 W14 Slides for Lecture 8
slide 7/29
A 4 × 4 DRAM array
A circuit schematic is shown on the next slide. There is no good commercial reason to build such a tiny DRAM array, but nevertheless the schematic can be used to partially explain how DRAM works. In a read operation, half of the bitlines get used to capture bit values from DRAM cells, and the other half are used as reference wires. This technique is called folded bitlines. The schematic does not show the physical layout of folded bitlines. The block labeled [THIS IS COMPLICATED!] has a lot to do! In there we need bitline drivers, sense amplifiers, refresh logic, and more . . .
SLIDE 8 ENCM 501 W14 Slides for Lecture 8
slide 8/29
BL3 BL2 BL1 BL0 D3 D2 D1 D0 DRAM CELL DRAM CELL DRAM CELL DRAM CELL BL3 BL2 DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL BL1 BL0 WL3 ADDRESS DECODER WL1 WL0 A0 A1 WL2 [THIS IS COMPLICATED!] CTRL
SLIDE 9 ENCM 501 W14 Slides for Lecture 8
slide 9/29
DRAM arrays have long latencies compared to SRAM arrays. Why?
- 1. DRAM arrays typically have much larger capacities
than SRAM arrays, so the ratio of cell dimensions to bitline length is much worse for DRAM arrays.
- 2. A passive capacitor (DRAM) is less effective at changing
bitline voltages than is an active pair of inverters (SRAM).
- 3. Today, SRAM circuits are usually on the same chip as
processor cores, while DRAMs are off-chip, connected to processor chips by wires that may be as long as tens of millimeters.
- 4. DRAM circuits have to dedicate some time to refresh,
but SRAM circuits don’t.
SLIDE 10
ENCM 501 W14 Slides for Lecture 8
slide 10/29
A 4 GB DRAM SO-DIMM
(Image source: Wikipedia—see http://en.wikipedia.org/wiki/File: 4GB_DDR3_SO-DIMM.jpg for details.)
SLIDE 11
ENCM 501 W14 Slides for Lecture 8
slide 11/29
A 4 GB DRAM SO-DIMM, continued
SO-DIMM: small outline dual inline memory module. The module in the image appears to have eight 4 Gb DRAM chips (but might have sixteen 2 Gb DRAM chips, with eight on each side of the module). 64 of the 204 connectors are for data. The rest are for address bit, control, power and ground, etc. The module shown can receive or send data at a rate of up to 10667 MB/s—64-bit transfers at a rate of 1333 million transfers per second. How long would it take two such modules—working in parallel—to transfer 64 bytes to a DRAM controller?
SLIDE 12
ENCM 501 W14 Slides for Lecture 8
slide 12/29
Why is DRAM bandwidth good when latency is so bad? A partial answer . . .
The internal arrangement of a typical 4 Gb DRAM chip might be four DRAM arrays—called banks—of 1 Gb each. The dimensions of a bank would then be 215 rows × 215 columns. So access to a single row accesses 32 Kb of data! It pays for the DRAM controller to do writes and reads of chunks of data much larger than 4 or 8 bytes. But because the data bus width of a DIMM is only 8 bytes, these big transfers have to be serialized into multi-transfer bursts.
SLIDE 13 ENCM 501 W14 Slides for Lecture 8
slide 13/29
Quantifying Cache Performance
Rather than starting with a multi-core system with multiple levels of caches and complex interactions between caches and TLBs, let’s start with a simple system:
◮ one core ◮ no virtual memory ◮ only one level of caches ◮ constant processor clock frequency
This is shown on the next slide . . .
SLIDE 14
ENCM 501 W14 Slides for Lecture 8
slide 14/29
No VM, only one level of cache . . .
CORE DRAM CONTROLLER DRAM MODULES CACHE L1 I- CACHE L1 D-
We’ll measure time in processor clock cycles.
SLIDE 15 ENCM 501 W14 Slides for Lecture 8
slide 15/29
Hits and misses
The purpose of a cache memory is to provide a fast mirror of a small portion of the contents of a much larger memory one level farther away from the processor core. (In our simple system, the next level is just DRAM.) A cache hit occurs when a memory access can be handled by a cache without any delay waiting for help from the next level
A cache miss, then, is a memory access that is not a hit. L1 I-caches and D-caches are generally designed to keep the core running at full speed, in the ideal, happy, but unlikely circumstance that all memory accesses hit in these caches.
SLIDE 16
ENCM 501 W14 Slides for Lecture 8
slide 16/29
Miss rates, miss penalties, memory access time (1)
Here is the definition of miss rate: miss rate = number of misses total number of cache accesses Miss rate is program-dependent and also depends on the design of a cache. What kinds of programs have low miss rates? What aspects of cache design lead to low miss rates? The miss penalty is defined as the average number of clock cycles a processor must stall in response to a miss. Even for the simple system we’re considering, it’s an average, not a constant property of the cache and DRAM hardware. Let’s write down some reasons why the length of stall might vary from one miss to the next.
SLIDE 17
ENCM 501 W14 Slides for Lecture 8
slide 17/29
Miss rates, miss penalties, memory access time (2)
Hit time can be defined as the length of time needed to complete a memory access in the case of a cache hit. It is likely to be 1 processor clock cycle in the case of an L1 cache. We can now define average memory access time (AMAT) as hit time + miss rate × miss penalty Suppose hit time is 1 cycle and miss penalty is 100 cycles. What is AMAT if the miss rate is 0? 1%? 5%? 50%? We’ll return to this kind of analysis to quantify the overall impact of miss rates and miss penalties on program running times, but first we’ll look qualitatively at design options for caches.
SLIDE 18 ENCM 501 W14 Slides for Lecture 8
slide 18/29
We’ll start with a very specific cache design
Most textbooks on computer architecture discuss cache design
- ptions in one of two ways:
◮ lengthy exploration of most of the available options,
followed by some specific examples of cache designs—this is what is done in Section B.1 of our textbook;
◮ presentation of structures that are too simple to work
very well, followed by presentation of more complex structures that perform better. Instead of either of those approaches, let’s start with a structure that would be fairly effective for an L1 cache in 2014, then consider the costs and benefits of changing that structure.
SLIDE 19 ENCM 501 W14 Slides for Lecture 8
slide 19/29
A 2-way set-associative 32 KB data cache
Address width is 32 bits. Reads and writes are supported for the following data sizes: byte, 16-bit halfword, 32-bit word, 64-bit doubleword. . . . . . . . . . Key . . . . . . set 0 set 1 set 2 set 253 set 254 set 255 8-to-256 decoder
8
index block status: 1 valid bit and 1 dirty bit per block tag: 1 18-bit stored tag per block data: 64-byte (512-bit) data block set status: 1 LRU bit per set way 1 way 0
SLIDE 20
ENCM 501 W14 Slides for Lecture 8
slide 20/29
Cache capacity
How is cache capacity defined? Why exactly is the capacity of our example data cache 32 KB?
SLIDE 21
ENCM 501 W14 Slides for Lecture 8
slide 21/29
Wires between the core and the example cache
Let’s look at all of the kinds of communication that can happen between the core and the cache. What unidirectional wires must there be to communicate information from the core to the cache? What unidirectional wires must there be to communicate information from the cache to the core? What kind of wires must be bidirectional?
SLIDE 22 ENCM 501 W14 Slides for Lecture 8
slide 22/29
Hit detection in the example data cache (1)
Detecting a hit is done the same way for reads and writes. The memory address is split into pieces as follows:
14 13 6 5 31
search tag index
block
For example, the address 0x1001abc4 would be split into search tag, index, and block offset as 0001 0000 0000 0001 10 10 1011 11 00 0100 (ENCM 369 split the piece called block offset into two parts, a block offset and a byte offset. It’s actually simpler to work with only a block offset, as given in the example above.)
SLIDE 23 ENCM 501 W14 Slides for Lecture 8
slide 23/29
Hit detection in the example data cache (2)
For our example address, the search tag, index, and block
0001 0000 0000 0001 10 10 1011 11 00 0100 The index 1010 11112 = 17510 selects set 175; all the other sets will be ignored. The search tag will be compared against both stored tags in set 175—if one of them matches the search tag and has a valid bit of 1, the cache access is a hit. If not, it’s a miss.
SLIDE 24
ENCM 501 W14 Slides for Lecture 8
slide 24/29
Completion of a read hit in the example cache
This only happens if result of hit detection was a hit. Let’s suppose that the data size requested for a read at address 0x1001abc4 is word. What updates happen in the cache, and what information goes back to the core?
SLIDE 25
ENCM 501 W14 Slides for Lecture 8
slide 25/29
Completion of a write hit in the example cache
As with a read, this only happens if result of hit detection was a hit. To handle writes, this particular cache uses a write-back strategy, in which write hits update the cache, and do not cause updates in the next level of the memory hierarchy. Let’s suppose that the data size requested for a write at address 0x1001abc4 is word. What updates happen in the cache, and what information goes back to the core?
SLIDE 26
ENCM 501 W14 Slides for Lecture 8
slide 26/29
Completion of a read miss in the example cache
A read miss means that the data the core was seeking is not in the cache. Again suppose that the data size requested for a read at address 0x1001abc4 is word. What will happen to get the core the data it needs? What updates are needed to make sure that the cache will behave efficiently and correctly in the future? Important: Make sure you know why it is absolutely not good enough to copy only one 32-bit word from DRAM to the cache!
SLIDE 27
ENCM 501 W14 Slides for Lecture 8
slide 27/29
Completion of a write miss in the example cache
A write miss means that the memory contents the core wanted to update are not currently reflected in the cache. Again suppose that the data size requested for a write at address 0x1001abc4 is word. What will happen to complete the write? What updates are needed to make sure that the cache will behave efficiently and correctly in the future?
SLIDE 28
ENCM 501 W14 Slides for Lecture 8
slide 28/29
Storage cells in the example cache
Data blocks are implemented with SRAM cells. The design tradeoffs for the cell design relate to speed, chip area, and energy use per read or write. Tags and status bits might be SRAM cells or might be CAM (“content addressable memory”) cells. We’ll look at that choice in the next lecture.
SLIDE 29 ENCM 501 W14 Slides for Lecture 8
slide 29/29
Upcoming Topics
◮ More about cache design and cache performance.
Related reading in Hennessy & Patterson: Sections B.1–B.3