Slides for Lecture 7 ENCM 501: Principles of Computer Architecture - - PowerPoint PPT Presentation

slides for lecture 7
SMART_READER_LITE
LIVE PREVIEW

Slides for Lecture 7 ENCM 501: Principles of Computer Architecture - - PowerPoint PPT Presentation

Slides for Lecture 7 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 30 January, 2014 slide 2/31 ENCM 501 W14


slide-1
SLIDE 1

Slides for Lecture 7

ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng

Electrical & Computer Engineering Schulich School of Engineering University of Calgary

30 January, 2014

slide-2
SLIDE 2

ENCM 501 W14 Slides for Lecture 7

slide 2/31

Previous Lecture

◮ endianness ◮ addressing modes ◮ examples of tradeoffs in instruction set design

slide-3
SLIDE 3

ENCM 501 W14 Slides for Lecture 7

slide 3/31

Today’s Lecture

◮ completion of previous lecture ◮ introduction to memory systems ◮ review of SRAM and DRAM

Related reading in Hennessy & Patterson: Sections 2.1, B.1

slide-4
SLIDE 4

ENCM 501 W14 Slides for Lecture 7

slide 4/31

Conditional branch options

Most ISAs make branch decisions based on a few bits called flag bits or condition code bits that sit within some kind of processor status register. Let’s look at this for a simple C example, in which j and k are int variables in registers: if (i < k) goto L1; x86-64 translation, assuming i in %eax, k in %edx: cmpl %edx, %eax # compare registers jl L1 # branch based on N and V flags jl means “jump if less than.” (Note: In reality the assembly language label almost certainly won’t be the same as the C label L1.)

slide-5
SLIDE 5

ENCM 501 W14 Slides for Lecture 7

slide 5/31

For the same C code, here is an ARM translation, assuming i in r0, k in r1: CMP r0, r1 ; compare registers BLT L1 ; branch based on N and V flags MIPS is unusual—the comparison result goes into a GPR. Suppose we have i in R4, k in R5 . . . SLT R8, R4, R5 # R8 = (R4 < R5) BNE R8, R0, L1 # branch if R8 != 0

slide-6
SLIDE 6

ENCM 501 W14 Slides for Lecture 7

slide 6/31

Conditional instructions in ARM

Recall from Assignment 1 that MIPS offers the conditional move instructions MOVN and MOVZ. (MIPS also has some similar floating-point conditional move instructions). ARM takes this idea to the extreme—every ARM instruction is conditional! Bits 31–28 of an ARM instruction are the so-called cond field, which specifies that the instruction either performs some action or is a no-op, depending on some condition on zero or more of the N, Z, V and C flags. Example ARM cond field patterns:

◮ 1110, for ALWAYS. The instruction is never a no-op.

This is the default cond field in ARM assembly language.

◮ 0000, for EQUAL. Execute the instruction if and only if

the Z flag is 1.

slide-7
SLIDE 7

ENCM 501 W14 Slides for Lecture 7

slide 7/31

The power of ARM conditional instructions is illustrated by this example . . . Here is some C code: if (i == 33 || i == 63) count++; If i and count are ints in ARM registers r0 and r1, here is ARM assembly language for the C code: TEQ r0, #33 ; # indicates immediate mode TEQNE r0, #63 ADDEQ r1, r1, #1 ; Note typo in Lec 6 slides! The cond field for the first instruction is 1110, for “always”. For the second instruction, it’s 0001, for “do it only if the Z flag is 0”, and for the third, it’s 0000, for “do it only if the Z flag is 1”.

slide-8
SLIDE 8

ENCM 501 W14 Slides for Lecture 7

slide 8/31

Acknowledgment: Example on previous slide adapted from an example on pages 129–130 of Hohl, W., ARM Assembly Language: Fundamentals and Techniques, c 2009, ARM (UK), published by CRC Press.

slide-9
SLIDE 9

ENCM 501 W14 Slides for Lecture 7

slide 9/31

MIPS versus ARM: Vague arguments

CPU time = IC × CPI × clock period MIPS attacks CPI by making instructions very simple and easy to pipeline. ARM tries to be close to MIPS with respect to CPI, and is much better than older CISC ISAs for CPI. ARM attacks IC by doing things in one instruction that might sometimes take two

  • r three MIPS instructions.
slide-10
SLIDE 10

ENCM 501 W14 Slides for Lecture 7

slide 10/31

MIPS versus ARM: How to be quantitative

A fair and thorough study would require at least:

◮ real applications that are reasonably good fits for both

ISAs;

◮ the best possible compilers for each of the ISAs; ◮ processors fabricated with the same transistor and

interconnect technology, and very similar die sizes. Even then, it might not be a truly fair fight between ISAs, if

  • ne side has better digital designers than the other.
slide-11
SLIDE 11

ENCM 501 W14 Slides for Lecture 7

slide 11/31

We’re moving on from ISA to microarchitecture

For (much) more about ISA design considerations, see Appendix K of the textbook, which is available in PDF format as a no-charge download. The first aspect of microarchitecture we’ll look at is the memory hierarchy.

slide-12
SLIDE 12

ENCM 501 W14 Slides for Lecture 7

slide 12/31

Views of memory: ISA versus microarchitecture (1)

The modern ISA view of memory is simple: Memory is flat. For a program on a 32-bit system, a few regions within the address space from 0 to 0x ffff ffff are available. As long as alignment rules are respected, any memory read is pretty much the same as any other read, and any memory write is pretty much the same as any other write. The story is essentially the same for 64-bit systems, except that the maximum address is 0x ffff ffff ffff ffff. This simplicity is great for compiler writers choosing addressing modes for instructions, and for linker writers finding ways to stitch pieces of machine language together into complete machine language programs.

slide-13
SLIDE 13

ENCM 501 W14 Slides for Lecture 7

slide 13/31

Views of memory: ISA versus microarchitecture (2)

The modern microarchitecture view of memory is that memory is not at all simple! Modern memory systems are designed as complex hierarchies, with some subsystems optimized for high speed and others for large capacity and/or low cost. Energy use per memory access may be an important factor as well. Understanding of this kind of hierarchy is critical at several levels of computer engineering. Examples:

◮ selection of processors for embedded applications ◮ systems software development—operating system kernels,

libraries, etc.

◮ application software development

slide-14
SLIDE 14

ENCM 501 W14 Slides for Lecture 7

slide 14/31

Components within a memory system

The schematic on the next page shows typical memory

  • rganization for a desktop computer in the time period from

about 1999 to 2004. The box labeled CORE would contain GPRs, ALUs, control circuits and so on. TLB stands for translation lookaside buffer. A TLB does high-speed translation of virtual addresses into physical addresses. The core generates virtual addresses—PC values for instruction fetches, and data addresses generated by load and store instructions. Most cache designs are based on physical addresses, and the DRAM circuits definitely require physical addresses.

slide-15
SLIDE 15

ENCM 501 W14 Slides for Lecture 7

slide 15/31

Sizes of boxes reflect neither chip area nor storage capacity!

CORE DRAM MODULES UNIFIED L2 CACHE L1 I- CACHE I-TLB L1 D- CACHE D-TLB DRAM CONTROLLER

The yellow box shows what would be included in a processor chip in the 1999–2004 time frame. In 2014, a quad-core chip would include four copies of everything in yellow, plus a large L3 cache shared by all four

  • cores. The DRAM controller would be on-chip.
slide-16
SLIDE 16

ENCM 501 W14 Slides for Lecture 7

slide 16/31

What are caches for?

In trying to make sense of the complicated interconnections and interactions between caches it really helps to keep in mind what problems are solved by caches and what very different problems are solved by virtual memory. Let’s start with caches. Caches exist to optimize performance in the face of some difficult facts:

◮ DRAM latency is on the order of 100 processor clock

cycles

◮ latency in small SRAM arrays is on the order of

1 processor clock cycle

◮ latency in larger SRAM arrays is on the order of

10 processor clock cycles

slide-17
SLIDE 17

ENCM 501 W14 Slides for Lecture 7

slide 17/31

What is virtual memory for?

Virtual memory is a system that operating system kernels can use to support applications. Some of the key benefits are:

◮ Protection. Each process—each running user

program—has its own virtual address space. Processes cannot accidentally or maliciously access each other’s memory.

◮ Efficient memory allocation. A kernel can give an

application a large contiguous piece of virtual address space made from many fragmented pieces of physical address space.

◮ Spilling to disk. If DRAM gets close to full, the kernel

can copy pages of application memory to disk—the effective memory available can be greater than the DRAM capacity.

slide-18
SLIDE 18

ENCM 501 W14 Slides for Lecture 7

slide 18/31

SRAM and DRAM

Before looking in detail at how caches work, let’s look at the two main kinds of volatile storage in use in computer systems.

slide-19
SLIDE 19

ENCM 501 W14 Slides for Lecture 7

slide 19/31

The “6T” SRAM (Static RAM) cell

Q

BITLINE BITLINE WORDLINE

QN

Q near VDD is a stored 1, and Q near ground is a stored 0. It’s called static RAM because in normal operation, and with

WORDLINE low, the voltages at nodes Q and QN are stable.

The bistable pair of inverters corrects for the effects of noise and leakage currents.

slide-20
SLIDE 20

ENCM 501 W14 Slides for Lecture 7

slide 20/31

Writing a 1 to an SRAM cell

Q

BITLINE BITLINE WORDLINE

QN

Set BITLINE to VDD and BITLINE to 0. Turn on WORDLINE. If Q was previously 0, the signals on the bitlines overpower the inverter pair, making QN 0 and Q 1. If Q was already 1, nothing much happens in the cell. (To write a 0, Set BITLINE to 0 and BITLINE to VDD.)

slide-21
SLIDE 21

ENCM 501 W14 Slides for Lecture 7

slide 21/31

Reading from an SRAM cell

Q

BITLINE BITLINE WORDLINE

QN

Pre-charge both BITLINE and BITLINE to equal voltages, somewhere near 0.5VDD. Turn on WORDLINE, just long enough for the cell to create a voltage difference between BITLINE and BITLINE, such that the difference can be reliably measured by a sense amplifier.

slide-22
SLIDE 22

ENCM 501 W14 Slides for Lecture 7

slide 22/31

A 4 × 4 SRAM array

A circuit schematic is shown on the next slide. The address inputs A1 and A0 allow four different addresses: 00, 01, 10, 11. The data lines D3, D2, D1, D0 are bidirectional to support both reads and writes. The group of signals labeled CTRL would include some kind of “select” signal to activate the circuit, a READ/WRITE signal, and possibly a clock signal. The term wordline is potentially misleading. A wordline activates an entire row of cells. The number of bits in a row may be smaller or much, much larger than the width of a processor word. How many cells would a single wordline activate in a 1 Mb SRAM array?

slide-23
SLIDE 23

ENCM 501 W14 Slides for Lecture 7

slide 23/31

BL3 BL2 BL1 BL0 D3 D2 D1 D0 SRAM CELL SRAM CELL SRAM CELL SRAM CELL BL3 BL2 SRAM CELL SRAM CELL SRAM CELL SRAM CELL SRAM CELL SRAM CELL SRAM CELL SRAM CELL SRAM CELL SRAM CELL SRAM CELL SRAM CELL BL1 BL0 WL3 ADDRESS DECODER WL1 WL0 A0 A1 WL2 BITLINE DRIVER AND SENSE AMP CIRCUITS CTRL

slide-24
SLIDE 24

ENCM 501 W14 Slides for Lecture 7

slide 24/31

Scaling up the SRAM array

SRAM arrays suitable for use in caches in processor chips have tens of thousands to tens of millions of SRAM cells. The length of a typical bitline would be a wire at most just a few millimeters long, quite small by human standards. It’s crucial to realize that such a wire is gigantic compared to a tiny SRAM cell! This problem gets worse as the SRAM array capacity grows. If the latency of an SRAM array is proportional to the length

  • f its bitlines, how does latency grow with the capacity of the

array?

slide-25
SLIDE 25

ENCM 501 W14 Slides for Lecture 7

slide 25/31

The “1T” DRAM (Dynamic RAM) cell

BITLINE

Q

WORDLINE

The bit is stored as a voltage on a capacitor. A relatively high voltage at Q is a 1, and a relatively low voltage at Q is a 0. When the stored bit is a 1, charge is slowly leaking from node Q to ground. In a DRAM array, each row of cells must periodically be read and written back to strengthen the voltages in cells with stored 1’s—this is called refresh. DRAM gets the name dynamic from the continuing activity needed to keep the stored data valid.

slide-26
SLIDE 26

ENCM 501 W14 Slides for Lecture 7

slide 26/31

Writing to a DRAM cell

BITLINE

Q

WORDLINE

Set BITLINE to the appropriate voltage for a 1 or a 0. Turn on WORDLINE. Q will take on the appropriate voltage.

slide-27
SLIDE 27

ENCM 501 W14 Slides for Lecture 7

slide 27/31

Reading from a DRAM cell

BITLINE

Q

WORDLINE

Pre-charge BITLINE and some nearby electrically similar reference wire to the same voltage, somewhere between logic 0 and logic 1. Turn on WORDLINE. The cell will create a voltage difference between BITLINE and the reference wire, such that the difference can be reliably measured by a sense amplifier.

slide-28
SLIDE 28

ENCM 501 W14 Slides for Lecture 7

slide 28/31

A 4 × 4 DRAM array

A circuit schematic is shown on the next slide. There is no good commercial reason to build such a tiny DRAM array, but nevertheless the schematic can be used to partially explain how DRAM works. In a read operation, half of the bitlines get used to capture bit values from DRAM cells, and the other half are used as reference wires. This technique is called folded bitlines. The schematic does not show the physical layout of folded bitlines. The block labeled [THIS IS COMPLICATED!] has a lot to do! In there we need bitline drivers, sense amplifiers, refresh logic, and more . . .

slide-29
SLIDE 29

ENCM 501 W14 Slides for Lecture 7

slide 29/31

BL3 BL2 BL1 BL0 D3 D2 D1 D0 DRAM CELL DRAM CELL DRAM CELL DRAM CELL BL3 BL2 DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL DRAM CELL BL1 BL0 WL3 ADDRESS DECODER WL1 WL0 A0 A1 WL2 [THIS IS COMPLICATED!] CTRL

slide-30
SLIDE 30

ENCM 501 W14 Slides for Lecture 7

slide 30/31

DRAM arrays have long latencies compared to SRAM arrays. Why?

  • 1. DRAM arrays typically have much larger capacities

than SRAM arrays, so the ratio of cell dimensions to bitline length is much worse for DRAM arrays.

  • 2. A passive capacitor (DRAM) is less effective at changing

bitline voltages than is an active pair of inverters (SRAM).

  • 3. Today, SRAM circuits are usually on the same chip as

processor cores, while DRAMs are off-chip, connected to processor chips by wires that may be as long as tens of millimeters.

  • 4. DRAM circuits have to dedicate some time to refresh,

but SRAM circuits don’t.

slide-31
SLIDE 31

ENCM 501 W14 Slides for Lecture 7

slide 31/31

Upcoming Topics

◮ Caches

Related reading in Hennessy & Patterson: Sections B.2–B.3