Slides for Lecture 21 ENCM 501: Principles of Computer Architecture - PowerPoint PPT Presentation

Slides for Lecture 21 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 27 March, 2014

/22 ENCM 501 W14 Slides for Lecture 21 Previous Lecture ◮ more examples of Tomasulo’s algorithm ◮ reorder buffers and speculation ◮ introduction to multiple issue of instructions Related reading in Hennessy & Patterson: Sections 3.5–3.8

/22 ENCM 501 W14 Slides for Lecture 21 Today’s Lecture ◮ overview of multiple issue of instructions ◮ quick note about limitations of ILP ◮ processes and threads ◮ introduction to SMP architecture ◮ introduction to memory coherency Related reading in Hennessy & Patterson: Sections 3.7, 3.8, 3.10, 5.1, 5.2

/22 ENCM 501 W14 Slides for Lecture 21 Multiple issue Multiple issue means issue of 2 or more instructions within a single processor core, in a single clock cycle. This is superscalar execution. Many quite different processor organization schemes support multiple issue. Examples: ◮ In-order processing of instructions in two parallel pipelines—e.g., ARM Cortex-A8 described in textbook Section 3.13. Obviously no more than two instructions issue per clock. ◮ In-order issue, out-of-order execution, in-order commit of up to six instructions per clock—current x86-64 microarchitectures, e.g., Intel Core i7, also described in textbook Section 3.13.

/22 ENCM 501 W14 Slides for Lecture 21 Multiple issue: Instruction unit requirements The instruction unit needs to be able to fetch multiple instructions per clock. This has obvious implications for design of L1 I-caches! The instruction unit has to look at pending instructions, and decide every clock cycle how many instructions are safe to issue in parallel. Let’s look at an easy example for MIPS32, in a microarchitecture that can issue a maximum of two instructions per clock: LW R8, 24(R9) ADDU R10, R11, R4 If the above instructions are the next two in line in program order, can they be issued in parallel? Why or why not?

/22 ENCM 501 W14 Slides for Lecture 21 Here is another easy example. The next three instructions in line, in program order, are: LW R8, 24(R9) ADDU R16, R16, R8 SLT R9, R8, R4 How will these instructions be managed? In general, the problem of deciding how many instructions to issue in parallel is difficult. See textbook Section 3.8 for further detail. (My favourite sentence in that section is, “Analyze all the dependencies among the instructions in the issue bundle.” That is only one of several things that has to happen every clock cycle! )

/22 ENCM 501 W14 Slides for Lecture 21 Limitations of IPC A shortage of time in ENCM 501 requires brutal oversimplification, but . . . Roughly speaking, computer architects have discovered that no matter how many transistors are thrown at the problem , it’s very hard to get more than an average of about 2.5 instructions completed per cycle, due mostly to dependencies between instructions. See textbook Figure 3.43 for real data—for 19 different SPECCPU2006 benchmark programs run on a Core i7, the best CPI is 0.44 (2.27 instructions per clock), and some programs have CPIs greater than 1.0!

/22 ENCM 501 W14 Slides for Lecture 21 So why try to support issue of 6 instructions/cycle? A few slides back: “In-order issue, out-of-order execution, in-order commit of up to six instructions per clock—current x86-64 microarchitectures, e.g., Intel Core i7.” It might pay to run two or more threads at the same time in the same core! That might get the average instruction issue rate close to 6 per cycle. Intel’s term for this idea is hyper-threading. (As of March 26, 2014, the Wikipedia article titled Hyper-threading provides a good introduction.) A more generic term for the same idea is simultaneous multithreading (SMT) —see textbook Section 3.12 for discussion.

/22 ENCM 501 W14 Slides for Lecture 21 Processes and Threads Suppose that you build an executable from a bunch of C source files. When you run the executable, it’s almost certain that the program will run in a single thread, unless your source files have explicitly asked in some way for multiple threads. (I use the word almost in case I am wrong about the state of the art for C compilers and program linkers.) All the C code used in ENCM 501 up to Assignment 7 is single-threaded. (And Assignment 8 has no C code at all.)

/22 ENCM 501 W14 Slides for Lecture 21 A one-thread process time start finish . . . work, work, work . . . Program results are generated as if instructions are processed in program order, as if each instruction finishes as the next instruction starts. If a processor is able to find significant ILP in the program, in the best case, CPI significantly less than 1 can be achieved. In practice, due to (a) difficulty in finding ILP, (b) TLB misses, and (c) cache misses, CPI greater than 1 is much more likely than CPI less than 1. Having many cores cannot improve the overall time spent on any one single one-thread process, but may help significantly with throughput of multiple one-thread processes.

/22 ENCM 501 W14 Slides for Lecture 21 A process with five threads time . . . four “worker” threads work . . . start finish . . . main thread waits . . . Each thread has its own PC, its own stack, and its own sets of GPRs and FPRs. If there are at least four cores, all four worker threads can run at the same time. Speedup relative to the one-thread version can be close to 4.0, but . . . watch out for Amdahl’s law.

/22 ENCM 501 W14 Slides for Lecture 21 time . . . four “worker” threads work . . . start finish . . . main thread waits . . . Important: All five threads share a common virtual address space. The OS kernel maintains a single page table for the process. Shared memory provides an opportunity to the programmer, but also a perhaps surprisingly complex challenge. Other resources, such as open files, are also shared by all threads belonging to a process.

/22 ENCM 501 W14 Slides for Lecture 21 SMP: Symmetric MultiProcessor architecture SMP is currently the dominant architecture for processor circuits in smartphones, laptops, desktops, and small servers. Hennessy and Patterson often use the term centralized shared-memory multiprocessor to mean the same thing as SMP. The key feature of SMP architecture is a single main memory , to which all cores have equally fast access . The Intel Core i7 is a well-known SMP chip . . .

/22 ENCM 501 W14 Slides for Lecture 21 Intel Core i7 cache and DRAM arrangement: core 0 core 1 core 2 core 3 L1 I L1 D L1 I L1 D L1 I L1 D L1 I L1 D private caches L2 unified L2 unified L2 unified L2 unified L3 shared cache DRAM controller bus to off-chip DRAM modules The above diagram shows relationships betweeen caches. See textbook page 29 for physical layout of a Core i7 die.

/22 ENCM 501 W14 Slides for Lecture 21 DSM/NUMA multiprocessor architecture DSM and NUMA are two names for the same thing. DSM: Distributed shared memory NUMA: Nonuniform memory access This kind of architecture has multiple main memories . Processors have relatively fast access to their local main memories, and relatively slow access to other main memories. This kind of architecture works well for larger servers, with too many cores for effective SMP. For the rest of ENCM 501, we will look at SMP only.

/22 ENCM 501 W14 Slides for Lecture 21 “Private” caches are not totally private! Process P has two threads: T1 and T2. T1 is running in core 0 and T2 is running core 0 core 1 in core 1. T1 and T2 both frequently L1 I L1 D L1 I L1 D read global variable G. L2 shared, unified (Remember T1 and T2 share a common virtual DRAM controller address space!) to off-chip How many copies of G are DRAM there in the memory system?

/22 ENCM 501 W14 Slides for Lecture 21 Let’s continue. Suppose that T1 only reads G, but core 0 core 1 T2 makes frequent reads from and occasional writes L1 I L1 D L1 I L1 D to G. L2 shared, unified Q1: Why is it unacceptable to allow T1 and T2 to DRAM controller proceed with different to off-chip “versions” of G after T2 DRAM writes to G? Q2a: What would be a perfect, but impossible solution to the problem? Q2b: A pretty-good, but also impossible solution? Q3: What are some practical solutions?

/22 ENCM 501 W14 Slides for Lecture 21 Cache coherency Multiprocessor systems require a coherent memory system. The concept of coherence is defined and explained over the next three slides. The material is a rephrasing of the discussion on textbook pages 352–353, in terms of actions within a multicore SMP chip.

/22 ENCM 501 W14 Slides for Lecture 21 Here is property 1 of a coherent memory system: A core writes to memory location X. Time passes, during which there are no further writes to X. The same core reads X. The value read must be the same value that was previously written. This just says that stores to and loads from a single location have to be done in program order. If a core does out-of-order execution of instructions, the design of the core must ensure that this property is not violated.

Slides for Lecture 21 ENCM 501: Principles of Computer Architecture - PowerPoint PPT Presentation

Slides for Lecture 21 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 27 March, 2014 slide 2/22 ENCM 501 W14

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 6 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 27 ENEL 353: Digital Circuits Fall

Assignments and Loops Rose-Hulman Institute of Technology Computer Science and Software

1 Analysis of high-dimensional data Theory Outline 2 Simultaneous Selection via Analysis of

Piecewise Isometries and Piecewise Contractions in Electronic Engineering Jonathan Deane

Computer Science & Engineering 423/823 Introduction Design and Analysis of Algorithms

Ali Kamandi Spring 2007 kamandi@sharif.edu Sharif University of Technology HyperText

Fast Nonlinear Model Predictive Control Algorithms and Applications in Process Engineering

The Design and Implementation of the WISE Science Data System Roc Cutri and the IPAC/WISE Team The

Overview of the art Framework Chris Green, for the art team. Common Infrastructure Software

Slides for Lecture 21 ENCM 501: Principles of Computer Architecture - PowerPoint PPT Presentation

Slides for Lecture 21 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 27 March, 2014 slide 2/22 ENCM 501 W14

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 6 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 27 ENEL 353: Digital Circuits Fall

Assignments and Loops Rose-Hulman Institute of Technology Computer Science and Software

1 Analysis of high-dimensional data Theory Outline 2 Simultaneous Selection via Analysis of

Piecewise Isometries and Piecewise Contractions in Electronic Engineering Jonathan Deane

Computer Science &amp; Engineering 423/823 Introduction Design and Analysis of Algorithms

Ali Kamandi Spring 2007 kamandi@sharif.edu Sharif University of Technology HyperText

Fast Nonlinear Model Predictive Control Algorithms and Applications in Process Engineering

The Design and Implementation of the WISE Science Data System Roc Cutri and the IPAC/WISE Team The

Overview of the art Framework Chris Green, for the art team. Common Infrastructure Software

Computer Science & Engineering 423/823 Introduction Design and Analysis of Algorithms