CS422 Computer Architecture Spring 2004 Lecture 15, 20 Feb 2004 - PowerPoint PPT Presentation

CS422 Computer Architecture Spring 2004 Lecture 15, 20 Feb 2004 Bhaskaran Raman Department of CSE IIT Kanpur http://web.cse.iitk.ac.in/~cs422/index.html

Further Topics in ILP ● Multiple issue ● Software support ● Hardware support

Increasing ILP through Multiple Issue ● With at most one issue per cycle, min CPI possible is 1 – But there are multiple functional units – Hence use multiple issue ● Two ways to do multiple issue – Superscalar processor ● Issue varying number of instructions per cycle ● Static or dynamic scheduling – Very Large Instruction Word (VLIW) ● Issue a fixed number of instructions

Superscalar DLX ● Simple version: two instructions issued per cycle – One integer (load, store, branch, integer ALU) and one FP – Instructions paired and aligned on 64-bit boundaries – int first, FP next CC1 CC2 CC3 CC4 CC5 CC6 Integer IF ID EX MEM WB FP IF ID EX MEM WB Integer IF ID EX MEM WB FP IF ID EX MEM WB

Superscalar DLX (continued) ● No conflicts, almost... – Assuming separate register sets, only FP load, store, move cause problems ● Structural hazard on register port ● New RAW hazard between a pair of instructions – Structural hazard: ● Detect, and do not issue the FP operation ● Or, provide additional register ports – RAW hazard: ● Detect, and do not issue the FP operation ● Also, result of LD cannot be used for 3 instns. ● And, branches have 3 delay slots now

Static Scheduling in the Superscalar DLX: An Example Loop: LD F0, 0(R1) // F0 is array element ADDD F4, F0, F2 // F2 has the scalar 'C' SD 0(R1), F4 // Stored result SUBI R1, R1, 8 // For next iteration BNEZ R1, Loop // More iterations? Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -8(R1) ADDD F4, F0, F2 LD F14, -8(R1) ADDD F8, F6, F2 LD F18, -8(R1) ADDD F12, F10, F2 SD 0(R1), F4 ADDD F16, F14, F2 SD -8(R1), F8 ADDD F20, F18, F2 SD -16(R1), F12 SUBI R1, R1, #40 SD -24(R1), F16 BNEZ R1, Loop

Dynamic Scheduling in the Superscalar DLX ● Scoreboard or Tomasulo can be applied ● Should preserve in-order issue! – Use separate data structures for Int and FP ● When the instruction pair has a dependence – We wish to issue both in the same cycle – Two approaches: ● Pipeline the issue stage, so that it runs twice as fast ● Exclude load/store buffers from the set of RSs

Multiple Issue using VLIW ● Superscalar ==> too much hardware – For hazard detection, scheduling ● Alternative: let compiler do all the scheduling – VLIW (Very Large Instruction Word) – E.g., an VLIW may include 2 Int, 2 FP, 2 mem, and a branch

Limitations to Multiple Issue ● Why not 10 issues per cycle? Why not 20? ● Three limitations: – Inherent ILP limitations in programs – Hardware costs (even for VLIW) ● Memory/register bandwidth – Implementation issues: ● Superscalar: complexity of hardware logic ● VLIW: increased code size, binary compatibility problems

Support for ILP ● Software (compiler) support ● Hardware support ● Combination of both

Compiler Support for ILP ● Loop unrolling: – Dependence analysis is a major component – Analysis is simple when array indices are linear in the loop variable (called affine indices) ● Limitations to dependence analysis: – Pointers – Indirect indexing – Analysis has to consider corner cases too

Compiler Support for ILP (continued) ● Two important techniques: – Software pipelining – Trace scheduling ● Software pipelining: reorganize a loop such that each iteration is made from instructions chosen from different iterations of the original loop

Software Pipelining Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Software pipelined iteration

Software Pipelining in Our Example Loop: LD F0, 0(R1) // F0 is array element ADDD F4, F0, F2 // F2 has the scalar 'C' SD 0(R1), F4 // Stored result SUBI R1, R1, 8 // For next iteration BNEZ R1, Loop // More iterations? Iter i: LD F0, 0(R1) Software Pipelined Loop ADDD F4, F0, F2 Loop: SD 16(R1), F4 SD 0(R1), F4 ADDD F4, F0, F2 Iter i+1: LD F0, 0(R1) LD F0, 0(R1) ADDD F4, F0, F2 SUBI R1, R1, 8 SD 0(R1), F4 BNEZ R1, Loop Iter i+2: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4

Trace Scheduling ● Compiler picks a program trace which it considers A[i] = A[i] + B[i] most likely – Schedule instructions from F T A[i] = 0? the trace – And branches into and out B[i] = ... X = ... of the trace – Also need bookkeeping instructions in case the trace is not taken during C[i] = ... execution

CS422 Computer Architecture Spring 2004 Lecture 15, 20 Feb 2004 - PowerPoint PPT Presentation

CS422 Computer Architecture Spring 2004 Lecture 15, 20 Feb 2004 Bhaskaran Raman Department of CSE IIT Kanpur http://web.cse.iitk.ac.in/~cs422/index.html Further Topics in ILP Multiple issue Software support Hardware support

CS422 Computer Architecture Spring 2004 Lecture 04, 06 Jan 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 23, 26 Mar 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 18, 26 Feb 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 13, 17 Feb 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 05, 06 Jan 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 33, 22 Apr 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 02, 01 Jan 2004 Bhaskaran Raman Department of

Theory of Computation Textbook The Nature of Computation by Cristopher Moore and (CS

User Interface Design and Programming - CS422 Luc Renambot renambot@uic.edu Yiwen Sun

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

CSE 675.02: three aspects of computer design: instruction set architecture, Introduction to

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture & Computer Architecture &

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

A New Golden Age for 1. Software advances can inspire architecture Computer Architecture:

ISA Implementations Partly in Run programs for one ISA on hardware with different ISA Techniques:

Exam Review 2 1 ROB: head/tail yes R1 B yes none no X5 R3 A none no no --- --- F

Cryptomaniac A Cautionary Tale Dont Let This Happen to You! AES Selection Process Started

nk nku u ing ac act t of f giv iving Part-2 Surya Nanda By Acharya Suryanarayan Nanda

Automatic Synthesis of Fast and Certified Code for Polynomial Evaluation through the example of

Chapter 2 Instruction-Level Parallelism and Its E Exploitation l it ti 1 Overview

Network Flow-based Bipartitioning Perform flow-based bipartitioning under: Area constraint

Spiral 2-8 Cell Layout 2-8.2 Learning Outcomes I understand how a digital circuit is

CS422 Computer Architecture Spring 2004 Lecture 15, 20 Feb 2004 - PowerPoint PPT Presentation

CS422 Computer Architecture Spring 2004 Lecture 15, 20 Feb 2004 Bhaskaran Raman Department of CSE IIT Kanpur http://web.cse.iitk.ac.in/~cs422/index.html Further Topics in ILP Multiple issue Software support Hardware support

CS422 Computer Architecture Spring 2004 Lecture 04, 06 Jan 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 23, 26 Mar 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 18, 26 Feb 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 13, 17 Feb 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 05, 06 Jan 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 33, 22 Apr 2004 Bhaskaran Raman Department of

CS422 Computer Architecture Spring 2004 Lecture 02, 01 Jan 2004 Bhaskaran Raman Department of

Theory of Computation Textbook The Nature of Computation by Cristopher Moore and (CS

User Interface Design and Programming - CS422 Luc Renambot renambot@uic.edu Yiwen Sun

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

CSE 675.02: three aspects of computer design: instruction set architecture, Introduction to

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture &amp; Computer Architecture &amp;

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

A New Golden Age for 1. Software advances can inspire architecture Computer Architecture:

ISA Implementations Partly in Run programs for one ISA on hardware with different ISA Techniques:

Exam Review 2 1 ROB: head/tail yes R1 B yes none no X5 R3 A none no no --- --- F

Cryptomaniac A Cautionary Tale Dont Let This Happen to You! AES Selection Process Started

nk nku u ing ac act t of f giv iving Part-2 Surya Nanda By Acharya Suryanarayan Nanda

Automatic Synthesis of Fast and Certified Code for Polynomial Evaluation through the example of

Chapter 2 Instruction-Level Parallelism and Its E Exploitation l it ti 1 Overview

Network Flow-based Bipartitioning Perform flow-based bipartitioning under: Area constraint

Spiral 2-8 Cell Layout 2-8.2 Learning Outcomes I understand how a digital circuit is

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture & Computer Architecture &