Slides for Lecture 21 ENCM 501: Principles of Computer Architecture - - PowerPoint PPT Presentation

slides for lecture 21
SMART_READER_LITE
LIVE PREVIEW

Slides for Lecture 21 ENCM 501: Principles of Computer Architecture - - PowerPoint PPT Presentation

Slides for Lecture 21 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 27 March, 2014 slide 2/22 ENCM 501 W14


slide-1
SLIDE 1

Slides for Lecture 21

ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng

Electrical & Computer Engineering Schulich School of Engineering University of Calgary

27 March, 2014

slide-2
SLIDE 2

ENCM 501 W14 Slides for Lecture 21

slide 2/22

Previous Lecture

◮ more examples of Tomasulo’s algorithm ◮ reorder buffers and speculation ◮ introduction to multiple issue of instructions

Related reading in Hennessy & Patterson: Sections 3.5–3.8

slide-3
SLIDE 3

ENCM 501 W14 Slides for Lecture 21

slide 3/22

Today’s Lecture

◮ overview of multiple issue of instructions ◮ quick note about limitations of ILP ◮ processes and threads ◮ introduction to SMP architecture ◮ introduction to memory coherency

Related reading in Hennessy & Patterson: Sections 3.7, 3.8, 3.10, 5.1, 5.2

slide-4
SLIDE 4

ENCM 501 W14 Slides for Lecture 21

slide 4/22

Multiple issue

Multiple issue means issue of 2 or more instructions within a single processor core, in a single clock cycle. This is superscalar execution. Many quite different processor organization schemes support multiple issue. Examples:

◮ In-order processing of instructions in two parallel

pipelines—e.g., ARM Cortex-A8 described in textbook Section 3.13. Obviously no more than two instructions issue per clock.

◮ In-order issue, out-of-order execution, in-order commit of

up to six instructions per clock—current x86-64 microarchitectures, e.g., Intel Core i7, also described in textbook Section 3.13.

slide-5
SLIDE 5

ENCM 501 W14 Slides for Lecture 21

slide 5/22

Multiple issue: Instruction unit requirements

The instruction unit needs to be able to fetch multiple instructions per clock. This has obvious implications for design

  • f L1 I-caches!

The instruction unit has to look at pending instructions, and decide every clock cycle how many instructions are safe to issue in parallel. Let’s look at an easy example for MIPS32, in a microarchitecture that can issue a maximum of two instructions per clock: LW R8, 24(R9) ADDU R10, R11, R4 If the above instructions are the next two in line in program

  • rder, can they be issued in parallel? Why or why not?
slide-6
SLIDE 6

ENCM 501 W14 Slides for Lecture 21

slide 6/22

Here is another easy example. The next three instructions in line, in program order, are: LW R8, 24(R9) ADDU R16, R16, R8 SLT R9, R8, R4 How will these instructions be managed? In general, the problem of deciding how many instructions to issue in parallel is difficult. See textbook Section 3.8 for further detail. (My favourite sentence in that section is, “Analyze all the dependencies among the instructions in the issue bundle.” That is only one of several things that has to happen every clock cycle!)

slide-7
SLIDE 7

ENCM 501 W14 Slides for Lecture 21

slide 7/22

Limitations of IPC

A shortage of time in ENCM 501 requires brutal

  • versimplification, but . . .

Roughly speaking, computer architects have discovered that no matter how many transistors are thrown at the problem, it’s very hard to get more than an average of about 2.5 instructions completed per cycle, due mostly to dependencies between instructions. See textbook Figure 3.43 for real data—for 19 different SPECCPU2006 benchmark programs run on a Core i7, the best CPI is 0.44 (2.27 instructions per clock), and some programs have CPIs greater than 1.0!

slide-8
SLIDE 8

ENCM 501 W14 Slides for Lecture 21

slide 8/22

So why try to support issue of 6 instructions/cycle?

A few slides back: “In-order issue, out-of-order execution, in-order commit of up to six instructions per clock—current x86-64 microarchitectures, e.g., Intel Core i7.” It might pay to run two or more threads at the same time in the same core! That might get the average instruction issue rate close to 6 per cycle. Intel’s term for this idea is hyper-threading. (As of March 26, 2014, the Wikipedia article titled Hyper-threading provides a good introduction.) A more generic term for the same idea is simultaneous multithreading (SMT)—see textbook Section 3.12 for discussion.

slide-9
SLIDE 9

ENCM 501 W14 Slides for Lecture 21

slide 9/22

Processes and Threads

Suppose that you build an executable from a bunch of C source files. When you run the executable, it’s almost certain that the program will run in a single thread, unless your source files have explicitly asked in some way for multiple threads. (I use the word almost in case I am wrong about the state of the art for C compilers and program linkers.) All the C code used in ENCM 501 up to Assignment 7 is single-threaded. (And Assignment 8 has no C code at all.)

slide-10
SLIDE 10

ENCM 501 W14 Slides for Lecture 21

slide 10/22

A one-thread process

time start finish . . . work, work, work . . . Program results are generated as if instructions are processed in program order, as if each instruction finishes as the next instruction starts. If a processor is able to find significant ILP in the program, in the best case, CPI significantly less than 1 can be achieved. In practice, due to (a) difficulty in finding ILP, (b) TLB misses, and (c) cache misses, CPI greater than 1 is much more likely than CPI less than 1. Having many cores cannot improve the overall time spent on any

  • ne single one-thread process, but may help significantly with

throughput of multiple one-thread processes.

slide-11
SLIDE 11

ENCM 501 W14 Slides for Lecture 21

slide 11/22

A process with five threads

start finish . . . main thread waits . . . . . . four “worker” threads work . . . time

Each thread has its own PC, its own stack, and its own sets of GPRs and FPRs. If there are at least four cores, all four worker threads can run at the same time. Speedup relative to the one-thread version can be close to 4.0, but . . . watch out for Amdahl’s law.

slide-12
SLIDE 12

ENCM 501 W14 Slides for Lecture 21

slide 12/22 start finish . . . main thread waits . . . . . . four “worker” threads work . . . time

Important: All five threads share a common virtual address space. The OS kernel maintains a single page table for the process. Shared memory provides an opportunity to the programmer, but also a perhaps surprisingly complex challenge. Other resources, such as open files, are also shared by all threads belonging to a process.

slide-13
SLIDE 13

ENCM 501 W14 Slides for Lecture 21

slide 13/22

SMP: Symmetric MultiProcessor architecture

SMP is currently the dominant architecture for processor circuits in smartphones, laptops, desktops, and small servers. Hennessy and Patterson often use the term centralized shared-memory multiprocessor to mean the same thing as SMP. The key feature of SMP architecture is a single main memory, to which all cores have equally fast access. The Intel Core i7 is a well-known SMP chip . . .

slide-14
SLIDE 14

ENCM 501 W14 Slides for Lecture 21

slide 14/22

Intel Core i7 cache and DRAM arrangement:

L1 I L1 D

L2 unified core 0

L1 I L1 D

L2 unified core 1

L1 I L1 D

L2 unified core 2

L1 I L1 D

L2 unified core 3 L3 shared cache DRAM controller bus to off-chip DRAM modules private caches

The above diagram shows relationships betweeen caches. See textbook page 29 for physical layout of a Core i7 die.

slide-15
SLIDE 15

ENCM 501 W14 Slides for Lecture 21

slide 15/22

DSM/NUMA multiprocessor architecture

DSM and NUMA are two names for the same thing. DSM: Distributed shared memory NUMA: Nonuniform memory access This kind of architecture has multiple main memories. Processors have relatively fast access to their local main memories, and relatively slow access to other main memories. This kind of architecture works well for larger servers, with too many cores for effective SMP. For the rest of ENCM 501, we will look at SMP only.

slide-16
SLIDE 16

ENCM 501 W14 Slides for Lecture 21

slide 16/22

“Private” caches are not totally private!

Process P has two threads: T1 and T2. T1 is running in core 0 and T2 is running in core 1. T1 and T2 both frequently read global variable G. (Remember T1 and T2 share a common virtual address space!) How many copies of G are there in the memory system?

to off-chip DRAM

L1 I L1 D

core 0

L1 I L1 D

core 1 L2 shared, unified DRAM controller

slide-17
SLIDE 17

ENCM 501 W14 Slides for Lecture 21

slide 17/22

Let’s continue. Suppose that T1 only reads G, but T2 makes frequent reads from and occasional writes to G. Q1: Why is it unacceptable to allow T1 and T2 to proceed with different “versions” of G after T2 writes to G?

to off-chip DRAM

L1 I L1 D

core 0

L1 I L1 D

core 1 L2 shared, unified DRAM controller

Q2a: What would be a perfect, but impossible solution to the problem? Q2b: A pretty-good, but also impossible solution? Q3: What are some practical solutions?

slide-18
SLIDE 18

ENCM 501 W14 Slides for Lecture 21

slide 18/22

Cache coherency

Multiprocessor systems require a coherent memory system. The concept of coherence is defined and explained over the next three slides. The material is a rephrasing of the discussion on textbook pages 352–353, in terms of actions within a multicore SMP chip.

slide-19
SLIDE 19

ENCM 501 W14 Slides for Lecture 21

slide 19/22

Here is property 1 of a coherent memory system: A core writes to memory location X. Time passes, during which there are no further writes to X. The same core reads X. The value read must be the same value that was previously written. This just says that stores to and loads from a single location have to be done in program order. If a core does out-of-order execution of instructions, the design of the core must ensure that this property is not violated.

slide-20
SLIDE 20

ENCM 501 W14 Slides for Lecture 21

slide 20/22

Here is property 2 of a coherent memory system: Core A writes to memory location X. Some sufficient amount of time passes, during which there are no more writes to X. Core B reads X. The value read by Core B must be the value written by Core A. The key word here is sufficient. If the read is too soon after the write, it’s permissible for hardware to allow B to read an

  • ut-of-date value from X!

It’s up to software to guard against the incorrect read that hardware may offer.

slide-21
SLIDE 21

ENCM 501 W14 Slides for Lecture 21

slide 21/22

Here is property 3 of a coherent memory system: Core A writes to memory location X. Later in time, Core B writes to memory location X. When both writes are complete, and there are no pending further writes to X, all cores that read X must get the value written by Core B. This is called write serialization. The danger that must be avoided is this: Suppose A and B announce their writes by broadcasting messages on a bus shared by all

  • cores. If the messages arrive out-of-order at Core

C, Core C could maintain a wrong value for X (the value that came from A) even though both writes completed.

slide-22
SLIDE 22

ENCM 501 W14 Slides for Lecture 21

slide 22/22

Upcoming Topics

◮ overview of cache coherency protocols ◮ hardware support for synchronization ◮ introduction to Pthreads programming

Related reading in Hennessy & Patterson: Section 5.2, 5.5 Other related reading: Assignment 9 instructions.