Superscalar Processors Raul Queiroz Feitosa Parts of these slides - PowerPoint PPT Presentation

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings

Objective To provide an overview of the superscalar approach and the key design issues associated with its implementation. 2 Superscalar Processors

Outline  Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic Scheduling 3 Superscalar Processors

Two Parallelism Concepts Instruction Level Parallelism (ILP) exists when instructions in a sequence are independent and thus can be executed in parallel, e.g., ... ADD EAX,ECX can be executed simultaneously keeping the MOV EBX,ESI same result as in a sequential execution ... Machine Parallelism is a measure of the ability of the processor to take advantage of ILP. 4 Superscalar Processors

Superpipelining Approach In a conventional pipeline the most time consuming task determines the clock rate.  clock period = 2 t 2t 2t 2t 2t 2t Ifetch Decode Execute Write STAGES t t 2t t A superpipelined machine runs at higher clock rates by splitting most time consuming stages into smaller stages.  clock period = t t t t t t more stages STAGES Pentium IV (20 stages) 6 Superscalar Processors

Superpipelining Execution Conventional Pipeline Time diagram Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time Superpipeline Time diagram 0 1 2 3 4 5 6 7 8 9 time 7 Superscalar Processors

Superscalar Approach A superscalar machine is able to execute multiple instructions independently and concurrently in multiple pipelines integer register file floating-point register file pipeline functional units memory General Superscalar Organization 8 Superscalar Processors

Superscalar Execution Conventional Pipeline Time diagram Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time Superscalar Time diagram Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time 9 Superscalar Processors

Limitations to Parallelism Resource Conflicts 1. Competition of two or more instructions for the same resource at the same time, e.g., consecutive arithmetic instructions → possible solution is adding a second ALU Procedural Dependency 2. conditional branches (already seen) in superscalar processors more is lost if prediction fails Data Dependencies 3. 11 Superscalar Processors

Data Dependencies ... True Data Dependency ADD EAX ,ECX 1. MOV EBX, EAX or Read-after-Write (RAW) ... ... ADD ,ECX EAX Output Dependency 2. MOV EAX ,EBX or Write-after-Write (WAW) ... ... Antidependency EAX 3. ADD EBX, EAX MOV ,ECX or Write-after-Read (WAR) ... 12 Superscalar Processors

Instruction Issue Policy It refers to: Instruction fetch: order in which instructions are 1. fetched Instruction execution: order in which instructions 2. are delivered to a functional unit to execute the operation Instruction commit: order in which instruction 3. results are stored in registers and memory 14 Superscalar Processors

In-order issue with in-order completion instruction issuing is stalled by resource conflicts, procedural or any data dependencies. Example: up to two instructions may be fetched, issued and written back at a 1. time fetch of next two instructions waits till decode buffer is cleared 2. 3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock. 3. Data dependency stalls instruction issuing until the execution of the 4. earlier instruction is completed. In RAW later instruction may be issued only after the earlier 5. instruction has written the result. 15 Superscalar Processors

In-order issue and completion decode decode / / * * +/- +/- write write CY CY 1 1 2 2 1 1 3 3 4 4 1 1 2 2 2 2 4 4 3 3 1 1 3 3 4 4 3 3 1 1 2 2 4 4 5 5 6 6 4 4 3 3 5 5 6 6 5 5 4 4 6 6 6 6 5 5 7 7 1. R3=R0*R1 1. R3=R0*R1 7 7 8 8 6 6 5 5 8 8 2. R4=R0+R2 2. R4=R0+R2 7 7 8 8 6 6 9 9 3. R5=R0/R1 3. R5=R0/R1 8 8 7 7 10 10 4. R6=R1+R4 4. R6=R1+R4 8 8 7 7 11 11 5. R7=R1*R2 5. R7=R1*R2 8 8 7 7 12 12 6. R1=R0-R2 6. R1=R0-R2 8 8 13 13 7. R3=R3*R1 7. R3=R3*R1 14 14 8. R1=R4+R4 8. R1=R4+R4 15 15 16 Superscalar Processors

Out-of-order issue and completion Instruction window  A buffer where decoded instruction are stored waiting for execution.  It decouples decode stages from execution stages  Can continue to fetch and decode until this window is full  When a functional unit becomes available an instruction can be executed  Since instructions have been decoded, processor can look ahead 17 Superscalar Processors

Out-of-order issue and completion instruction issuing is stalled by resource conflicts, procedural or TRUE data dependencies. Example: Up to two instructions may be fetched, issued and written back at a 1. time 3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock. 2. Data dependency does not stall instruction issuing. 3. In RAW later instruction may be issued only after the earlier 4. instruction has written the result. 18 Superscalar Processors

Out-of-order issue and completion decode decode / / * * +/- +/- write write CY CY 1 1 2 2 1 1 3 3 4 4 1 1 2 2 2 2 5 5 6 6 3 3 1 1 2 2 3 3 7 7 8 8 3 3 5 5 4 4 1 1 4 4 5 5 6 6 3 3 4 4 5 5 8 8 5 5 6 6 6 6 7 7 8 8 7 7 1. R3=R0*R1 7 7 8 8 2. R4=R0+R2 7 7 9 9 3. R5=R0/R1 10 10 4. R6=R1+R4 Register 11 11 5. R7=R1*R2 Renaming 12 12 13 13 6. R1=R0-R2 S1 14 14 7. R3=R3*R1 S1 15 15 8. R1=R4+R4 S2 19 Superscalar Processors

Out-of-order issue with In-order completion decode / * +/- write CY 1 2 1 Exercise : How would it 3 4 1 2 2 5 6 3 1 3 be if out-of-order issue is 7 8 3 1 2 4 allowed but in-order 5 4 3 5 completion is required? 5 6 4 6 8 5 6 7 1. R3=R0*R1 7 8 2. R4=R0+R2 7 9 3. R5=R0/R1 7 8 10 4. R6=R1+R4 11 5. R7=R1*R2 12 13 6. R1=R0-R2 14 7. R3=R3*R1 15 8. R1=R4+R4 20 Superscalar Processors

Exercises decode / * +/- write CY 1 2 1 1 Exercise 1: Complete the tables on the 1 2 right under the same assumptions of 3 4 2 1 3 the previous examples for the program 3 2 4 4 fragment below and for in-order issue 3 2 5 and completion. 5 6 5 3 4 6 6 5 7 1. R3=R0*R1 7 6 8 2. R4=R0*R2 7 9 3. R5=R0/R1 7 10 4. R6=R5+R4 7 11 5. R5=R1-R2 12 6. R1=R0-R2 13 7. R3=R3*R1 14 15 21 Superscalar Processors

Exercises decode / * +/- write CY 1 2 1 Exercise 2: Complete the tables on the 3 4 1 2 right under the same assumptions of 5 6 3 1 3 the previous examples for the program 7 3 2 4 1 4 fragment below and for out-of-order 2 5 3 4 5 issue and completion. 7 6 2 5 6 7 6 7 1. R3=R0*R1 7 8 2. R4=R0*R2 9 3. R5=R0/R1 10 4. R6=R5+R4 11 5. R5=R1-R2 12 6. R1=R0-R2 13 7. R3=R3*R1 14 15 22 Superscalar Processors

Exercises decode / * +/- write CY 1 2 1 Exercise 3: Complete the tables on the 1 2 right under the same assumptions of 1 3 the previous examples for the program 3 4 3 2 4 fragment below and for in-order issue 3 4 2 5 and completion. 5 6 4 3 6 7 6 4 5 7 1. R3=R0-R1 6 8 2. R4=R0+R3 7 6 9 3. R3=R0/R1 7 10 4. R6=R5*R4 7 11 5. R5=R1-R2 12 6. R1=R0*R2 13 7. R3=R3*R5 14 15 23 Superscalar Processors

Exercises Decode / * +/- write CY 1 2 1 Exercise 4: Complete the tables on the 3 4 1 2 right under the same assumptions of 5 6 3 5 1 3 the previous examples for the program 7 3 6 2 5 4 fragment below and for out-of-order 6 2 3 5 issue and completion. 7 6 6 7 7 1. R3=R0-R1 7 8 2. R4=R0+R3 9 3. R3=R0/R1 10 4. R6=R5*R4 11 5. R5=R1-R2 12 6. R1=R0*R2 13 7. R3=R3*R5 14 15 24 Superscalar Processors

Register Renaming Logical registers contain pointers to hidden registers, which actually contain the data. S0 S1 S2 3 R0 S3 4 R1 S4 7 R2 S5 5 R3 S6 S7 Logical Registers Hidden Registers > contain pointers to contain data hidden Registers HW keeps track of non committed hidden registers. 26 Superscalar Processors

Superscalar Processors Raul Queiroz Feitosa Parts of these slides - PowerPoint PPT Presentation

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings Objective To provide an overview of the superscalar approach and the key design issues associated with its implementation.

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

Lecturer: Francesco Quaglia Hardware insights Pipelining and superscalar processors

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

Lecturer: Francesco Quaglia Hardware insights Pipelining and superscalar processors

Overview Computer architecture Scaling performance and CMOS 1 Trends in Microprocessor

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Today Digital Signal Processors Digital signal processors Microcontrollers are optimized

Stochastic Processors (or processors that do not always compute correctly by design) Rakesh Kumar

Implementing out-of-order execution processors IBM 360/91 High performance substrate CSE240A:

MS degree in Computer Engineering University of Rome Tor Vergata Lecturer: Francesco Quaglia

ASIC accelerators 1 Part 2 serial codes out Part 1 due tomorrow, 11:59PM Homework 3

Testing: Our Experiences Test Case Sof tware Testing Software to be tested Output When to

Resource Allocation for Hardware Implementations of Map Richard Townsend Martha A. Kim Stephen

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Integration Testing Chapter 13 Integration Testing Test the interfaces and interactions among

A Time-Multiplexed FPGA Overlay with Linear Interconnect Xiangwei Li , Douglas L. Maskell

Fei Li and Lei He Li and Lei He Fei ECE Dept. ECE Dept. University of Wisconsin

Todays Lecture Slides for Lecture 19 ENCM 501: Principles of Computer Architecture Winter