Slides for Lecture 11 ENCM 501: Principles of Computer Architecture - - PowerPoint PPT Presentation

▶

Jun 12, 2023 252 likes •525 views

Slides for Lecture 11 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 13 February, 2014 slide 2/25 ENCM 501

SLIDE 1

Slides for Lecture 11

ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng

Electrical & Computer Engineering Schulich School of Engineering University of Calgary

13 February, 2014

SLIDE 2

ENCM 501 W14 Slides for Lecture 11

slide 2/25

Previous Lecture

12:30 to 1:10pm: Quiz #1 1:15 to 1:45pm . . .

◮ fully-associative caches ◮ cache options for handling writes ◮ write buffers ◮ multi-level caches

SLIDE 3

ENCM 501 W14 Slides for Lecture 11

slide 3/25

Today’s Lecture

◮ more about multi-level caches ◮ classifying cache misses: the 3 C’s ◮ introduction to virtual memory

ENCM 501 W14 Slides for Lecture 11

slide 4/25

AMAT in a two-level cache system

Textbook formula: AMAT = L1 hit time + L1 miss rate × (L2 hit time + L2 miss rate × L2 miss penalty) It was pointed out (I think) in the previous lecture that the L1 hit time should be weighted by the L1 hit rate. What reasonable assumption would imply that such a weighting would be INCORRECT? This definition (not from the textbook!) is incorrect: For a system with two levels of caches, the L2 hit rate of a program is the number of L2 hits divided by the total number of memory accesses. What is a correct definition for L2 hit rate, compatible with the formula for AMAT?

SLIDE 5

ENCM 501 W14 Slides for Lecture 11

slide 5/25

L2 cache design tradeoffs

An L1 cache must keep up with a processor core. That is a challenge to a circuit design team but keeps the problem simple: If a design is too slow, it fails. For an L2 cache, the tradeoffs are more complex:

◮ Increasing capacity improves L2 miss rate but makes L2

hit time, chip area and (probably) energy use worse.

◮ Decreasing capacity improves L2 hit time, chip area, and

(probably) energy use, but makes L2 miss rate worse. Suppose L1 hit time = 1 cycle, L1 miss rate = 0.020, L2 miss penalty = 100 cycles. Which is better, considering AMAT only, not chip area or energy?

◮ (a) L2 hit time = 10 cycles, L2 miss rate = 0.50 ◮ (b) L2 hit time = 12 cycles, L2 miss rate = 0.40

SLIDE 6

ENCM 501 W14 Slides for Lecture 11

slide 6/25

Valid bits in caches going 1 → 0

I hope it’s obvious why V bits for blocks go 0 → 1. But why might V bits go 1 → 0? In other words, why does it sometimes make sense to invalidate one or more cache blocks? Here are two big reasons. (There are likely some other good reasons.)

◮ DMA: direct memory access. ◮ Instruction writes by O/S kernels and by programs that

write their own instructions. Let’s make some notes about each of these reasons.

SLIDE 7

ENCM 501 W14 Slides for Lecture 11

slide 7/25

3 C’s of cache misses: compulsory, capacity, conflict

It’s useful to think about the causes of cache misses. Compulsory misses (sometimes called “cold misses”) happen

n access to instructions or data that have never been in a
cache. Examples include:

◮ instruction fetches in a program that has just been copied

from disk to main memory;

◮ data reads of information that has just been copied from

a disk controller or network interface to main memory. Compulsory misses would happen even if a cache had the same capacity as the main memory the cache was supposed to mirror.

SLIDE 8

ENCM 501 W14 Slides for Lecture 11

slide 8/25

Capacity misses

This kind of miss arises because a cache is not big enough to contain all the instructions and/or data a program accesses while it runs. Capacity misses for a program can be counted by simulating a program run with a fully-associative cache of some fixed

capacity. Since instruction and data blocks can be placed

anywhere within a fully-associative cache, it’s reasonable to assume that any miss on access to a previously accessed instruction or data item in a fully-associative cache occurs because the cache is not big enough. Why is this a good but not perfect approximation?

SLIDE 9

ENCM 501 W14 Slides for Lecture 11

slide 9/25

Conflict misses

Conflict misses (also called “collison misses”) occur in direct-mapped and N-way set-associative caches because too many accesses to memory generate a common index. In the absence of a 4th kind of miss—coherence misses, which can happen when multiple processors share access to a memory system—we can write: conflict misses = total misses − compulsory misses − capacity misses The main idea behind increasing set-associativity is to reduce conflict misses without the time and energy problems of a fully-associative cache.

SLIDE 10

ENCM 501 W14 Slides for Lecture 11

slide 10/25

3 C’s: Data from experiments

Textbook Figure B.8 has a lot of data; it’s unreasonable to try to jam all of that data into a few lecture slides. So here’s a subset of the data, for 8 KB capacity. N is the degree of associativity, and miss rates are in misses per thousand accesses. miss rates N compulsory capacity conflict 1 0.1 44 24 2 0.1 44 5 4 0.1 44 < 0.5 8 0.1 44 < 0.5 This is real data from practical applications. It is worthwhile to study the table to see what general patterns emerge.

SLIDE 11

ENCM 501 W14 Slides for Lecture 11

slide 11/25

Caches and Virtual Memory

Both are essential systems to support applications running on modern operating systems. As mentioned two weeks ago, it really helps to keep in mind what problems are solved by caches and what very different problems are solved by virtual memory. Caches are an impressive engineering workaround for difficult facts about relative latencies of memory arrays. Virtual memory (VM) is a concept, a great design idea, that solves a wide range of problems for a computer systems in which multiple applications are sharing resources.

SLIDE 12

ENCM 501 W14 Slides for Lecture 11

slide 12/25

VM preliminaries: The O/S kernel

When a computer with an operating system is powered up or reset, instructions in ROM begin the job of copying a special program called the kernel from the file system into memory. The kernel is a vital piece of software—once it is running, it controls hardware—memory, file systems, network interfaces, etc.—and schedules the access of other running programs to processor cores.

SLIDE 13

ENCM 501 W14 Slides for Lecture 11

slide 13/25

VM preliminaries: Processes

A process can be defined as an instance of a program in execution. Because the kernel has special behaviour and special powers not available to other running programs, the kernel is usually not considered to be a process. So when a computer is in normal operation there are many programs running concurrently: one kernel and many processes.

SLIDE 14

ENCM 501 W14 Slides for Lecture 11

slide 14/25

VM preliminaries: Examples of processes

Suppose you are typing a command into a terminal window on Linux system. Two processes are directly involved: the terminal and a shell—the shell is the program that interprets your commands and launches other programs in response to your commands. Suppose you enter the command gcc foo.c bar.c A flurry of processes will come and go—one for the driver program gcc, two invocations of the compiler cc1, two invocations of the assembler as, and one invocation of the linker. Then if you enter the command ./a.out, a process will be created from the executable you just built.

SLIDE 15

ENCM 501 W14 Slides for Lecture 11

slide 15/25

#1 problem solved by VM: protection

Processes need to be able to access memory quickly but safely. It would be disastrous if a process could accidentally or maliciously access memory in use for kernel instructions or kernel data. It would also be disastrous if processes could accidentally or maliciously access each other’s memory. (In the past, perhaps the #1 problem solved by VM was allowing the combined main memory use of all processes to exceed DRAM capacity. That’s still important today, but less important than it used to be, because current DRAM circuits are cheap and have huge capacities.)

SLIDE 16

ENCM 501 W14 Slides for Lecture 11

slide 16/25

How VM provides memory protection

The kernel gives a virtual address space to each process. Suppose P is a process. P can use its own virtual address space with

◮ no risk that P will access other processes’ memory; ◮ no risk other processes will access P’s memory.

(That is a slight oversimplification—modern OSes allow intentional sharing of memory by cooperating processes.) Processes never know the physical DRAM addresses of the memory they use. The addresses used by processes are virtual addresses, which get translated into physical addresses for access to memory circuits. Translations are managed by the kernel.

SLIDE 17

ENCM 501 W14 Slides for Lecture 11

slide 17/25

Example address splits for VM

Let’s show the address split for an address width of 40 bits and a page size of 4 KB. Let’s show the address split for an address width of 48 bits and a page size of 4 KB. Let’s show the address split for an address width of 40 bits and a “huge page” size of 2 MB.

SLIDE 19

ENCM 501 W14 Slides for Lecture 11

slide 19/25

Virtual addresses and physical addresses

Processes work with virtual addresses. So the PC (program counter) register, used for instruction fetches, all the pointer variables a process uses for data accesses, and all the array element addresses a process generates are virtual addresses. To actually fetch an instruction or read or write data memory, a virtual address must be translated into a physical address. (For relative simplicity, for now let’s assume that caches work entirely with physical addresses, so address splits into tags, indexes, and block offsets are based on physical addresses. We’ll revisit this assumption later.)

SLIDE 20

ENCM 501 W14 Slides for Lecture 11

slide 20/25

Translation is simple

ffset

page number

ffset

page physical page number virtual page translation (no translation!) straight copy virtual address physical address

Obviously, the page offsets have to have the same width in both addresses. Do the VPN and PPN have to have the same width?

SLIDE 21

ENCM 501 W14 Slides for Lecture 11

slide 21/25

Page tables and TLBs

The kernel controls all of the sets of translations for all of the processes running on a system. The kernel maintains many page tables—one page table for each process. A page table is a big data structure in main memory, a master list of all of the VPN-to-PPN translations in use for a particular process. If every instruction fetch and every data memory read or write done by a process required a search in a page table, instruction throughput would be absurdly low. Specialized circuits called TLBs (translation lookaside buffers) are set up so that most VPN-to-PPN translations can be done very fast—in one or two processor clock cycles.

SLIDE 22

ENCM 501 W14 Slides for Lecture 11

slide 22/25

Memory hierarchy in the Pentium III / Pentium 4 era

CORE DRAM MODULES UNIFIED L2 CACHE L1 I- CACHE I-TLB L1 D- CACHE D-TLB DRAM CONTROLLER

Typical address widths in that era: 32-bit virtual addresses, 36-bit physical addresses.

SLIDE 23

ENCM 501 W14 Slides for Lecture 11

slide 23/25

Linux / Mac OS X virtual address spaces on x86-64

Pointers are 64 bits wide, but only the least significant 48 bits are used in a virtual address.

0x0000 7fff ffff ffff 0x0000 7fff ffff fffe 0x0000 0000 0000 0000 . . . virtual address space for user processes virtual address space for O/S kernel 0xffff ffff ffff ffff 0xffff ffff ffff fffe 0xffff 8000 0000 0000 . . . HUGE range of invalid addresses byte address

(For 64-bit Microsoft Windows, the picture is either identical,

r not quite the same but very similar.)

SLIDE 24

ENCM 501 W14 Slides for Lecture 11

slide 24/25

A page table for an x86-64 Linux process

The normal page size is 4 KB. Let’s sketch a page table for a single process.

Slides for Lecture 11

ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng

13 February, 2014

ENCM 501 W14 Slides for Lecture 11

slide 2/25

Previous Lecture

12:30 to 1:10pm: Quiz #1 1:15 to 1:45pm . . .

ENCM 501 W14 Slides for Lecture 11

slide 3/25

Today’s Lecture

Related reading in Hennessy & Patterson: Sections B.2–B.4

ENCM 501 W14 Slides for Lecture 11

slide 4/25

AMAT in a two-level cache system

ENCM 501 W14 Slides for Lecture 11

slide 5/25

L2 cache design tradeoffs

An L1 cache must keep up with a processor core. That is a challenge to a circuit design team but keeps the problem simple: If a design is too slow, it fails. For an L2 cache, the tradeoffs are more complex:

hit time, chip area and (probably) energy use worse.

(probably) energy use, but makes L2 miss rate worse. Suppose L1 hit time = 1 cycle, L1 miss rate = 0.020, L2 miss penalty = 100 cycles. Which is better, considering AMAT only, not chip area or energy?

ENCM 501 W14 Slides for Lecture 11

slide 6/25

Valid bits in caches going 1 → 0

I hope it’s obvious why V bits for blocks go 0 → 1. But why might V bits go 1 → 0? In other words, why does it sometimes make sense to invalidate one or more cache blocks? Here are two big reasons. (There are likely some other good reasons.)

write their own instructions. Let’s make some notes about each of these reasons.

ENCM 501 W14 Slides for Lecture 11

slide 7/25

3 C’s of cache misses: compulsory, capacity, conflict

It’s useful to think about the causes of cache misses. Compulsory misses (sometimes called “cold misses”) happen

from disk to main memory;

a disk controller or network interface to main memory. Compulsory misses would happen even if a cache had the same capacity as the main memory the cache was supposed to mirror.

ENCM 501 W14 Slides for Lecture 11

slide 8/25

Capacity misses

This kind of miss arises because a cache is not big enough to contain all the instructions and/or data a program accesses while it runs. Capacity misses for a program can be counted by simulating a program run with a fully-associative cache of some fixed

anywhere within a fully-associative cache, it’s reasonable to assume that any miss on access to a previously accessed instruction or data item in a fully-associative cache occurs because the cache is not big enough. Why is this a good but not perfect approximation?

ENCM 501 W14 Slides for Lecture 11

slide 9/25

Conflict misses

ENCM 501 W14 Slides for Lecture 11

slide 10/25

3 C’s: Data from experiments

ENCM 501 W14 Slides for Lecture 11

slide 11/25

Caches and Virtual Memory

ENCM 501 W14 Slides for Lecture 11

slide 12/25

VM preliminaries: The O/S kernel

ENCM 501 W14 Slides for Lecture 11

slide 13/25

VM preliminaries: Processes

ENCM 501 W14 Slides for Lecture 11

slide 14/25

VM preliminaries: Examples of processes

ENCM 501 W14 Slides for Lecture 11

slide 15/25

#1 problem solved by VM: protection

ENCM 501 W14 Slides for Lecture 11

slide 16/25

How VM provides memory protection

The kernel gives a virtual address space to each process. Suppose P is a process. P can use its own virtual address space with

ENCM 501 W14 Slides for Lecture 11

slide 17/25

Pages

belongs to;

ENCM 501 W14 Slides for Lecture 11

slide 18/25

Example address splits for VM

Let’s show the address split for an address width of 40 bits and a page size of 4 KB. Let’s show the address split for an address width of 48 bits and a page size of 4 KB. Let’s show the address split for an address width of 40 bits and a “huge page” size of 2 MB.

ENCM 501 W14 Slides for Lecture 11

slide 19/25

Virtual addresses and physical addresses

ENCM 501 W14 Slides for Lecture 11

slide 20/25

Translation is simple

page number

page physical page number virtual page translation (no translation!) straight copy virtual address physical address

Obviously, the page offsets have to have the same width in both addresses. Do the VPN and PPN have to have the same width?

ENCM 501 W14 Slides for Lecture 11

slide 21/25