SLIDE 1 Slides for Lecture 11
ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng
Electrical & Computer Engineering Schulich School of Engineering University of Calgary
13 February, 2014
SLIDE 2 ENCM 501 W14 Slides for Lecture 11
slide 2/25
Previous Lecture
12:30 to 1:10pm: Quiz #1 1:15 to 1:45pm . . .
◮ fully-associative caches ◮ cache options for handling writes ◮ write buffers ◮ multi-level caches
SLIDE 3 ENCM 501 W14 Slides for Lecture 11
slide 3/25
Today’s Lecture
◮ more about multi-level caches ◮ classifying cache misses: the 3 C’s ◮ introduction to virtual memory
Related reading in Hennessy & Patterson: Sections B.2–B.4
SLIDE 4
ENCM 501 W14 Slides for Lecture 11
slide 4/25
AMAT in a two-level cache system
Textbook formula: AMAT = L1 hit time + L1 miss rate × (L2 hit time + L2 miss rate × L2 miss penalty) It was pointed out (I think) in the previous lecture that the L1 hit time should be weighted by the L1 hit rate. What reasonable assumption would imply that such a weighting would be INCORRECT? This definition (not from the textbook!) is incorrect: For a system with two levels of caches, the L2 hit rate of a program is the number of L2 hits divided by the total number of memory accesses. What is a correct definition for L2 hit rate, compatible with the formula for AMAT?
SLIDE 5 ENCM 501 W14 Slides for Lecture 11
slide 5/25
L2 cache design tradeoffs
An L1 cache must keep up with a processor core. That is a challenge to a circuit design team but keeps the problem simple: If a design is too slow, it fails. For an L2 cache, the tradeoffs are more complex:
◮ Increasing capacity improves L2 miss rate but makes L2
hit time, chip area and (probably) energy use worse.
◮ Decreasing capacity improves L2 hit time, chip area, and
(probably) energy use, but makes L2 miss rate worse. Suppose L1 hit time = 1 cycle, L1 miss rate = 0.020, L2 miss penalty = 100 cycles. Which is better, considering AMAT only, not chip area or energy?
◮ (a) L2 hit time = 10 cycles, L2 miss rate = 0.50 ◮ (b) L2 hit time = 12 cycles, L2 miss rate = 0.40
SLIDE 6 ENCM 501 W14 Slides for Lecture 11
slide 6/25
Valid bits in caches going 1 → 0
I hope it’s obvious why V bits for blocks go 0 → 1. But why might V bits go 1 → 0? In other words, why does it sometimes make sense to invalidate one or more cache blocks? Here are two big reasons. (There are likely some other good reasons.)
◮ DMA: direct memory access. ◮ Instruction writes by O/S kernels and by programs that
write their own instructions. Let’s make some notes about each of these reasons.
SLIDE 7 ENCM 501 W14 Slides for Lecture 11
slide 7/25
3 C’s of cache misses: compulsory, capacity, conflict
It’s useful to think about the causes of cache misses. Compulsory misses (sometimes called “cold misses”) happen
- n access to instructions or data that have never been in a
- cache. Examples include:
◮ instruction fetches in a program that has just been copied
from disk to main memory;
◮ data reads of information that has just been copied from
a disk controller or network interface to main memory. Compulsory misses would happen even if a cache had the same capacity as the main memory the cache was supposed to mirror.
SLIDE 8 ENCM 501 W14 Slides for Lecture 11
slide 8/25
Capacity misses
This kind of miss arises because a cache is not big enough to contain all the instructions and/or data a program accesses while it runs. Capacity misses for a program can be counted by simulating a program run with a fully-associative cache of some fixed
- capacity. Since instruction and data blocks can be placed
anywhere within a fully-associative cache, it’s reasonable to assume that any miss on access to a previously accessed instruction or data item in a fully-associative cache occurs because the cache is not big enough. Why is this a good but not perfect approximation?
SLIDE 9
ENCM 501 W14 Slides for Lecture 11
slide 9/25
Conflict misses
Conflict misses (also called “collison misses”) occur in direct-mapped and N-way set-associative caches because too many accesses to memory generate a common index. In the absence of a 4th kind of miss—coherence misses, which can happen when multiple processors share access to a memory system—we can write: conflict misses = total misses − compulsory misses − capacity misses The main idea behind increasing set-associativity is to reduce conflict misses without the time and energy problems of a fully-associative cache.
SLIDE 10
ENCM 501 W14 Slides for Lecture 11
slide 10/25
3 C’s: Data from experiments
Textbook Figure B.8 has a lot of data; it’s unreasonable to try to jam all of that data into a few lecture slides. So here’s a subset of the data, for 8 KB capacity. N is the degree of associativity, and miss rates are in misses per thousand accesses. miss rates N compulsory capacity conflict 1 0.1 44 24 2 0.1 44 5 4 0.1 44 < 0.5 8 0.1 44 < 0.5 This is real data from practical applications. It is worthwhile to study the table to see what general patterns emerge.
SLIDE 11
ENCM 501 W14 Slides for Lecture 11
slide 11/25
Caches and Virtual Memory
Both are essential systems to support applications running on modern operating systems. As mentioned two weeks ago, it really helps to keep in mind what problems are solved by caches and what very different problems are solved by virtual memory. Caches are an impressive engineering workaround for difficult facts about relative latencies of memory arrays. Virtual memory (VM) is a concept, a great design idea, that solves a wide range of problems for a computer systems in which multiple applications are sharing resources.
SLIDE 12
ENCM 501 W14 Slides for Lecture 11
slide 12/25
VM preliminaries: The O/S kernel
When a computer with an operating system is powered up or reset, instructions in ROM begin the job of copying a special program called the kernel from the file system into memory. The kernel is a vital piece of software—once it is running, it controls hardware—memory, file systems, network interfaces, etc.—and schedules the access of other running programs to processor cores.
SLIDE 13
ENCM 501 W14 Slides for Lecture 11
slide 13/25
VM preliminaries: Processes
A process can be defined as an instance of a program in execution. Because the kernel has special behaviour and special powers not available to other running programs, the kernel is usually not considered to be a process. So when a computer is in normal operation there are many programs running concurrently: one kernel and many processes.
SLIDE 14
ENCM 501 W14 Slides for Lecture 11
slide 14/25
VM preliminaries: Examples of processes
Suppose you are typing a command into a terminal window on Linux system. Two processes are directly involved: the terminal and a shell—the shell is the program that interprets your commands and launches other programs in response to your commands. Suppose you enter the command gcc foo.c bar.c A flurry of processes will come and go—one for the driver program gcc, two invocations of the compiler cc1, two invocations of the assembler as, and one invocation of the linker. Then if you enter the command ./a.out, a process will be created from the executable you just built.
SLIDE 15
ENCM 501 W14 Slides for Lecture 11
slide 15/25
#1 problem solved by VM: protection
Processes need to be able to access memory quickly but safely. It would be disastrous if a process could accidentally or maliciously access memory in use for kernel instructions or kernel data. It would also be disastrous if processes could accidentally or maliciously access each other’s memory. (In the past, perhaps the #1 problem solved by VM was allowing the combined main memory use of all processes to exceed DRAM capacity. That’s still important today, but less important than it used to be, because current DRAM circuits are cheap and have huge capacities.)
SLIDE 16 ENCM 501 W14 Slides for Lecture 11
slide 16/25
How VM provides memory protection
The kernel gives a virtual address space to each process. Suppose P is a process. P can use its own virtual address space with
◮ no risk that P will access other processes’ memory; ◮ no risk other processes will access P’s memory.
(That is a slight oversimplification—modern OSes allow intentional sharing of memory by cooperating processes.) Processes never know the physical DRAM addresses of the memory they use. The addresses used by processes are virtual addresses, which get translated into physical addresses for access to memory circuits. Translations are managed by the kernel.
SLIDE 17 ENCM 501 W14 Slides for Lecture 11
slide 17/25
Pages
The basic unit of virtual memory is called a page. The size of a page must be a power of two. Different systems have different page sizes, and some instances a single system will support two or more different page sizes at the same time. A very common page size is 4 KB. How many 4 KB pages are available in a system with 8 GB of memory? An address in a VM system is split into
◮ a page number which indicates which page the address
belongs to;
◮ and a page offset which gives the location within a page
- f a byte, word, or similar small chunk of data
SLIDE 18
ENCM 501 W14 Slides for Lecture 11
slide 18/25
Example address splits for VM
Let’s show the address split for an address width of 40 bits and a page size of 4 KB. Let’s show the address split for an address width of 48 bits and a page size of 4 KB. Let’s show the address split for an address width of 40 bits and a “huge page” size of 2 MB.
SLIDE 19
ENCM 501 W14 Slides for Lecture 11
slide 19/25
Virtual addresses and physical addresses
Processes work with virtual addresses. So the PC (program counter) register, used for instruction fetches, all the pointer variables a process uses for data accesses, and all the array element addresses a process generates are virtual addresses. To actually fetch an instruction or read or write data memory, a virtual address must be translated into a physical address. (For relative simplicity, for now let’s assume that caches work entirely with physical addresses, so address splits into tags, indexes, and block offsets are based on physical addresses. We’ll revisit this assumption later.)
SLIDE 20 ENCM 501 W14 Slides for Lecture 11
slide 20/25
Translation is simple
page number
page physical page number virtual page translation (no translation!) straight copy virtual address physical address
Obviously, the page offsets have to have the same width in both addresses. Do the VPN and PPN have to have the same width?
SLIDE 21
ENCM 501 W14 Slides for Lecture 11
slide 21/25
Page tables and TLBs
The kernel controls all of the sets of translations for all of the processes running on a system. The kernel maintains many page tables—one page table for each process. A page table is a big data structure in main memory, a master list of all of the VPN-to-PPN translations in use for a particular process. If every instruction fetch and every data memory read or write done by a process required a search in a page table, instruction throughput would be absurdly low. Specialized circuits called TLBs (translation lookaside buffers) are set up so that most VPN-to-PPN translations can be done very fast—in one or two processor clock cycles.
SLIDE 22
ENCM 501 W14 Slides for Lecture 11
slide 22/25
Memory hierarchy in the Pentium III / Pentium 4 era
CORE DRAM MODULES UNIFIED L2 CACHE L1 I- CACHE I-TLB L1 D- CACHE D-TLB DRAM CONTROLLER
Typical address widths in that era: 32-bit virtual addresses, 36-bit physical addresses.
SLIDE 23 ENCM 501 W14 Slides for Lecture 11
slide 23/25
Linux / Mac OS X virtual address spaces on x86-64
Pointers are 64 bits wide, but only the least significant 48 bits are used in a virtual address.
0x0000 7fff ffff ffff 0x0000 7fff ffff fffe 0x0000 0000 0000 0000 . . . virtual address space for user processes virtual address space for O/S kernel 0xffff ffff ffff ffff 0xffff ffff ffff fffe 0xffff 8000 0000 0000 . . . HUGE range of invalid addresses byte address
(For 64-bit Microsoft Windows, the picture is either identical,
- r not quite the same but very similar.)
SLIDE 24
ENCM 501 W14 Slides for Lecture 11
slide 24/25
A page table for an x86-64 Linux process
The normal page size is 4 KB. Let’s sketch a page table for a single process.
SLIDE 25 ENCM 501 W14 Slides for Lecture 11
slide 25/25
Topics after Reading Week
Short-term:
◮ More about virtual memory. ◮ Interaction of caches and VM hardware.
Related reading in Hennessy & Patterson: Sections B.4–B.5. Big topics for the second half of the course:
◮ Instruction-level parallelism. ◮ Thread-level parallelism.
Related reading in Hennessy & Patterson: Appendix C, Chapters 3 and 5.