goals
play

Goals Understand the terms and ideas used in a modern, - PowerPoint PPT Presentation

12b.1 12b.2 Goals Understand the terms and ideas used in a modern, high-performance processor CS356 Unit 12b Various systems have different kinds of processors and you should understand the pros and cons of each kind of processor


  1. 12b.1 12b.2 Goals • Understand the terms and ideas used in a modern, high-performance processor CS356 Unit 12b • Various systems have different kinds of processors and you should understand the pros and cons of each kind of processor Advanced Processor Organization • Terms to listen for and understand the concept: – Superscalar/multiple issue, loop unrolling, register renaming, out-of-order execution, speculation, and branch prediction 12b.3 12b.4 A New Instruction • In x86, we often perform – cmp %rax, %rbx – je L1 or jne L1 • Many instruction sets have a single instruction that both compares and jumps (limited to registers only) – je %rax, %rbx, L1 – jne %rax, %rbx, L1 INSTRUCTION LEVEL PARALLELISM • Let us assume x86 supports such an instruction in our subsequent discussion

  2. 12b.5 12b.6 Have We Hit The Limit Exploiting Parallelism • With increasing transistor budgets of modern processors (i.e. • Under ideal circumstances, pipeline would can do more things at the same time) the question becomes allow us to achieve a throughput how do we find enough useful tasks to increase performance, (IPC = Instruction per clock) of __________ or, put another way, what is the most effective ways of exploiting parallelism! • Can we do better? Can we execute more than • Many types of parallelism available one instruction per clock? – _____________ Level Parallelism (ILP): Overlapping instructions within – Not with a single pipeline a single process/thread of execution – ___________ Level Parallelism (TLP): Overlap execution of multiple – But what if we had ____________________ processes / threads – What if we fetched multiple __________ per clock – _________ Level Parallelism (DLP): Overlap an operation (instruction) that is to be applied to multiple data values (usually in an array) and let them run down the pipeline in parallel • for(i=0; i < MAX; i++) { A[i] = A[i] + 5; } • Let's exploit _______________! • We'll focus on ILP in this unit 12b.7 12b.8 Basic Blocks Instruction Level Parallelism (ILP) • Basic Block (def.) = Sequence of instructions that will • Although a program defines a sequential ordering of instructions, in reality ________ be executed ___________ many instructions can be executed in parallel. • ILP refers to the process of finding instructions from a single program/thread ld 0(%r8),%r9 – No conditional branches out and %r10,%r11 of execution that can be executed in parallel This is a L1: add %r8,%r12 – No branch targets coming in basic block or %r11,%r13 • Data flow (data ______________) is what truly _________ ordering (starts w/ sub %r14,%r10 _______, ends – Also called "straight-line" code – We call these dependencies _________________________ Hazards jeq %r12,%r14,L1 with _______) xor %r10,%r15 • Independent instructions can be __________________ – Average size: ______ instrucs. • Control hazards also provide ordering constraints • Instructions in a basic block can be overlapped if ld 0(%r8), %r9 LD AND SUB ADD and %r10, %r11 write %r11 there are no data dependencies or %r11, %r13 read %r11 sub %r14, %r15 write %r15 Dependency add %r10, %r12 • ____________ dependences really limit our window write %r12 Graph OR JE je $0,%r12,L1 read %r12 xor %r15, %rax read %r15 of possible instructions to overlap XOR Cycle 1: / / / – Without extra hardware, we can only overlap execution of Cycle 2: / / / Cycle 3: / / / instructions within a basic block

  3. 12b.9 12b.10 Superscalar Superscalar (Multiple Issue) • When airplanes broke the sound barrier we said • Multiple "pipelines" that can fetch, decode, and they were super-sonic potentially execute more than 1 instruction per clock • When processor (HW) can complete ____________ – k-way superscalar = Ability to complete up to k instructions instruction per clock cycle we say they are super- scalar per clock cycle • Problem : The HW can execute 2 or more • Benefits This Photo by Unknown Author is licensed under CC BY-NC-ND instructions during the same cycle but the SW may be written and compiled assuming 1 instruction – Theoretical throughput greater than 1 (IPC > 1) executing at a time. • Problems • Solutions – Hazards – ______________ the code and rely on the ________ to safely order instructions that can be run in parallel • Dependencies between instructions limiting parallelism (static scheduling) • Branch/jump requires flushing all pipelines – Build the ______ to be smart, _______ instructions – Finding enough parallel instructions on the fly while guaranteeing correctness (dynamic scheduling) 12b.11 12b.12 Data Flow and Dependency Graphs ld 0(%r8), %r9 • The compiler produces a and %r9, %r11 or %r11, %r13 sequential order of instructions sub %r14, %r15 add %r10, %r12 je $0,%r12,L1 xor %r15, %r9 • Modern processors will transform the sequential order to execute instructions in Compiler-based solutions parallel LD STATIC MULTIPLE ISSUE MACHINES • Instructions can be executed in AND SUB ADD any valid __________________ OR JE of the dependency graph XOR

  4. 12b.13 12b.14 Static Multiple Issue Example 2-way VLIW machine • ___________ is responsible for finding and packaging • One issue slot for INT/BRANCH operations & another for LD/ST instructions instructions that can execute in parallel into issue packets • I-Cache reads out an entire issue packet (more than 1 instruction) – Only certain combinations of instructions can be in a packet together • HW is added to allow many registers to be accessed at one time – Instruction packet example: – Just more multiplexers • (1) Integer/Branch instruction slot • Address Calculation Unit (just a simple adder) • (1) LD/ST instruction • (1) FP operation Integer Slot Integer Slot add %rcx,%rax PC PC • An issue packet is often thought of as an LONG instruction INT/BRANCH ALU ALU containing multiple instructions Reg. Reg. File (a.k.a. _ ery _ ong _ nstruction _ ord) I-Cache I-Cache (4 Read, File Addr. Addr. LD/ST Slot LD/ST Slot – Intel’s Itanium used this technique (static multiple issue) but called it D-Cache D-Cache 2 Write) Calc. Calc. EPIC ( _ xplicitly _ arallel _ nstruction _ omputer) ld 8(%rdi),%rdx LD/ST Issue Packet = More than 1 instruction 12b.15 12b.16 Sample Scheduling 2-way VLIW Scheduling time Int./Branch Slot LD/ST Slot • 1.) No forwarding w/in an issue packet (between instructions in a packet) • Schedule the following • 2.) Full forwarding to previous instructions loop body on our 2-way – Those behind in the pipeline static issue machine • 3.) Still 1 stall cycle necessary when LD is followed by a dependent instruction void f1(int* A, int n) { for( ; n != 0; n--, A++) w/o modifying original code but with code movement *A += 5; 2 IPC = ___ instrucs. / ___ cycles = _____ } sub %rax,%rbx add %rcx,%rax Integer Slot Int./Branch Slot LD/ST Slot PC 1 # %rdi = A ALU # %esi = n = # of iterations st %rax,0(%rdi) or %rcx,%rdx L1: ld 0(%rdi),%r9 Reg. I-Cache or %rcx,%rdx add $5,%r9 3 File Addr. st %r9,0(%rdi) LD/ST Slot D-Cache add $4,%rdi Calc. add $-1,%esi VLIW (issue jne $0,%esi,L1 packet) ld 0(%rdi),%rcx ld 0(%rdi),%rcx 3 w/ modifications and code movement IPC = ___ instrucs. / ___ cycle = _____ This Photo by Unknown Author is licensed under CC BY-SA

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend