1 What Limits Performance? Stalls (Data Hazards) Data hazards - PowerPoint PPT Presentation

Instruction Scheduling Background: Pipelining Basics Last time Idea – Register allocation – Begin executing an instruction before completing the previous one Today Without Pipelining With Pipelining – Instruction scheduling – The problem: Pipelined computer architecture time time – A solution: List scheduling Instr 0 Instr 0 instructions instructions Instr 1 Instr 1 Instr 2 Instr 2 Instr 3 Instr 3 Instr 4 Instr 4 CS553 Lecture Instruction Scheduling 1 CS553 Lecture Instruction Scheduling 2 Idealized Instruction Data-Path Pipelining Details Instructions go through several stages of execution Observations – Individual instructions are no faster (but throughput is higher) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 – Potential speedup determined by number of stages (more or less) Instruction Instruction Memory Register – Filling and draining pipe limits speedup Decode & Execute ⇒ ⇒ ⇒ ⇒ Fetch Access Write-back – Rate through pipe is limited by slowest stage Register Fetch – Less work per stage implies faster clock IF ID/RF EX MEM WB ⇒ ⇒ ⇒ ⇒ Modern Processors time – Long pipelines: 5 (Pentium), 14 (Pentium Pro), 22 (Pentium 4) instructions – Issue 2 (Pentium), 4 (UltraSPARC) or more (dead Compaq EV8) IF ID EX MM WB instructions per cycle IF ID EX MM WB – Dynamically schedule instructions (from limited instruction window) IF ID EX MM WB or statically schedule ( e.g ., IA-64) IF ID EX MM WB – Speculate IF ID EX MM WB – Outcome of branches IF ID EX MM WB – Value of loads (research) CS553 Lecture Instruction Scheduling 3 CS553 Lecture Instruction Scheduling 4 1

What Limits Performance? Stalls (Data Hazards) Data hazards Code – Instruction depends on result of prior instruction that is still in the pipe // $r1 is the destination add $r1,$r2,$r3 mul $r4,$r1,$r1 // $r4 is the destination Structural hazards Pipeline picture – Hardware cannot support certain instruction sequences because of limited hardware resources time instructions IF ID EX MM WB Control hazards IF ID EX MM WB – Control flow depends on the result of branch instruction that is still in the pipe An obvious solution – Stall (insert bubbles into pipeline) CS553 Lecture Instruction Scheduling 5 CS553 Lecture Instruction Scheduling 6 Stalls (Structural Hazards) Stalls (Control Hazards) Code Code // Suppose multiplies take two cycles // if $r1==0 , branch to label mul $r1,$r2,$r3 bz $r1, label mul $r4,$r5,$r6 add $r2,$r3,$r4 Pipeline Picture Pipeline Picture time time instructions instructions IF ID EX EX MM WB IF ID EX MM WB IF ID EX EX MM WB IF ID EX MM WB CS553 Lecture Instruction Scheduling 7 CS553 Lecture Instruction Scheduling 8 2

Hardware Solutions Instruction Scheduling for Pipelined Architectures Data hazards Goal – Data forwarding (doesn’t completely solve problem) – An efficient algorithm for reordering instructions to minimize pipeline – Runtime speculation (doesn’t always work) stalls Structural hazards Constraints – Hardware replication (expensive) – Data dependences (for correctness) – More pipelining (doesn’t always work) – Hazards (can only have performance implications) Control hazards Possible Simplifications – Runtime speculation (branch prediction) – Do scheduling after instruction selection and register allocation – Only consider data hazards Dynamic scheduling – Can address all of these issues – Very successful CS553 Lecture Instruction Scheduling 9 CS553 Lecture Instruction Scheduling 10 Data Dependences Register Renaming Data dependence Idea – A data dependence is an ordering constraint on 2 statements – Reduce false data dependences by reducing register reuse – When reordering statements, all data dependences must be observed to – Give the instruction scheduler greater freedom preserve program correctness Example True (or flow) dependences add $r1, $r2, 1 add $r1, $r2, 1 st $r1, [$fp+52] st $r1, [$fp+52] – Write to variable x followed by a read of x (read after write or RAW) mul $r1, $r3, 2 mul $r11, $r3, 2 x = 5; st $r1, [$fp+40] st $r11, [$fp+40] print (x); Anti-dependences – Read of variable x followed by a write (WAR) add $r1, $r2, 1 print (x); mul $r11, $r3, 2 x = 5; false Output dependences st $r1, [$fp+52] dependences – Write to variable x followed by x = 6; st $r11, [$fp+40] x = 5; another write to x (WAW) CS553 Lecture Instruction Scheduling 11 CS553 Lecture Instruction Scheduling 12 3

Phase Ordering Problem List Scheduling [Gibbons & Muchnick ’86] Register allocation Scope – Tries to reuse registers – Basic blocks – Artificially constrains instruction schedule Assumptions – Pipeline interlocks are provided ( i.e., algorithm need not introduce no-ops) Just schedule instructions first? – Pointers can refer to any memory address ( i.e., no alias analysis) – Scheduling can dramatically increase register pressure – Hazards take a single cycle (stall); here let’s assume there are two... – Load immediately followed by ALU op produces interlock Classic phase ordering problem – Store immediately followed by load produces interlock – Tradeoff between memory and parallelism Main data structure: dependence DAG Approaches – Nodes represent instructions – Consider allocation & scheduling together – Edges (s 1 ,s 2 ) represent dependences between instructions – Instruction s 1 must execute before s 2 – Run allocation & scheduling multiple times – Sometimes called data dependence graph or data-flow graph (schedule, allocate, schedule) CS553 Lecture Instruction Scheduling 13 CS553 Lecture Instruction Scheduling 14 Dependence Graph Example Scheduling Heuristics dst src src Sample code Dependence graph Goal – Avoid stalls 1 addi $r2,1,$r1 1 2 3 2 addi $sp,12,$sp 1 1 2 3 st a, $r0 Consider these questions 4 ld $r3,-4($sp) 4 1 5 8 – Does an instruction interlock with any immediate successors in the 5 ld $r4,-8($sp) 2 dependence graph? IOW is the delay greater than 1? 2 1 2 6 addi $sp,8,$sp – How many immediate successors does an instruction have? 6 9 7 st 0($sp),$r2 – Is an instruction on the critical path? 8 1 ld $r5,a 9 addi $r4,1,$r4 7 Hazards in current schedule (3,4), (5,6), (7,8), (8,9) Any topological sort is okay, but we want best one CS553 Lecture Instruction Scheduling 15 CS553 Lecture Instruction Scheduling 16 4

Scheduling Heuristics (cont) Scheduling Algorithm Idea: schedule an instruction earlier when... Build dependence graph G Candidates ← set of all roots (nodes with no in-edges) in G – It does not interlock with the previously scheduled instruction (avoid stalls) while Candidates ≠ ∅ – It interlocks with its successors in the dependence graph Select instruction s from Candidates {Using heuristics—in order} (may enable successors to be scheduled without stall) Schedule s – It has many successors in the graph Candidates ← Candidates − s (may enable successors to be scheduled with greater flexibility) Candidates ← Candidates ∪ “exposed” nodes – It is on the critical path {Add to Candidates those nodes whose (the goal is to minimize time, after all) predecessors have all been scheduled} CS553 Lecture Instruction Scheduling 17 CS553 Lecture Instruction Scheduling 18 Scheduling Example Scheduling Example (cont) Original code Dependence Graph Scheduled Code 3 st a, $r0 1 addi $r2,1,$r1 3 st a, $r0 1 addi 2 3 st addi 2 addi $sp,12,$sp 2 addi $sp,12,$sp 2 addi $sp,12,$sp 1 1 2 5 ld $r4,-8($sp) 3 st a, $r0 5 ld $r4,-8($sp) 1 4 ld $r3,-4($sp) 4 ld $r3,-4($sp) 4 ld $r3,-4($sp) 4 5 ld 8 ld ld 8 ld $r5,a 5 ld $r4,-8($sp) 8 ld $r5,a 1 2 2 2 1 addi $r2,1,$r1 6 addi $sp,8,$sp 1 addi $r2,1,$r1 6 9 addi addi 6 addi $sp,8,$sp 7 st 0($sp),$r2 6 addi $sp,8,$sp 1 7 8 st 0($sp),$r2 ld $r5,a 7 st 0($sp),$r2 9 9 addi $r4,1,$r4 addi $r4,1,$r4 9 addi $r4,1,$r4 7 st Candidates Hazards in new schedule Hazards in original schedule Hazards in new schedule 1 addi $r2,1,$r1 (8,1) (3,4), (5,6), (7,8), (8,9) (8,1) 5 2 addi ld $sp,12,$sp $r4,-8($sp) 4 7 3 st ld st 0($sp),$r2 $r3,-4($sp) a, $r0 8 ld $r5,a 9 addi $r4,1,$r4 6 addi $sp,8,$sp CS553 Lecture Instruction Scheduling 19 CS553 Lecture Instruction Scheduling 20 5

1 What Limits Performance? Stalls (Data Hazards) Data hazards - PowerPoint PPT Presentation

Instruction Scheduling Background: Pipelining Basics Last time Idea Register allocation Begin executing an instruction before completing the previous one Today Without Pipelining With Pipelining Instruction scheduling The

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan , Aaron Harlap

Ideas for evolution of replication technology @ CERN Openlab Minor Review December 14 th , 2010

The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann, Paul

Reconfigurable hardware for big ig data Gustavo Alonso Systems Group Department of Computer

The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

High availability and analysis of PostgreSQL Sergey Kalinin 18-19 of April 2012, dCache

Zerto Virtual Replication 4.5 Disaster Recovery Evolved Zerto provides enterprise-class, virtual

Replicating the Performance Evaluation of an N-Body Application on a Manycore Accelerator

ASTEROID AN ANALYZABLE, RESILIENT, EMBEDDED REAL-TIME OPERATING SYSTEM DESIGN Bj orn D

Content Replication in I2-DSI using Rsync+ Bert J Dempsey Debra Weiss University of North

an Object-Based File System for Large-Scale Federated IT Infrastructures Jan Stender, Zuse

MoSeL: A General, Extensible Modal Framework for Interactive Proofs in Separation Logic Robbert

Chapter 1 systems. Appreciate the evolution of computers. Introduction Understand the

IPv6 over Low power WPAN WG (6lowpan) Chairs: Geoff Mulligan <geoff@mulligan.com> Carsten

Gender Diversity at the Top: Good Intentions and Unexpected Consequences GENDER DIVERSITY

Message Authentication Codes (MACs) Tung Chou Technische Universiteit Eindhoven, The Netherlands

References Message Authentication Codes (MACs) Message Authentication Codes (MACs), Chapter

Objectives Security Notions of MACs NMACs and HMACs CBC-MACs Low Power Ajit Pal

MESSAGE AUTHENTICATION CODES and PRF DOMAIN EXTENSION The goal is to ensure that M really

3-509

CS 642: Midterm 1 Review Questions and General Study Pointers March 2020 1 Threat Modeling,

Tight PRF-Security of Double-block Hash-then-Sum MACs Seongkwang Kim, Byeonghak Lee , Jooyoung Lee

Lecture 7 - Applied Cryptography CSE497b - Spring 2007 Introduction Computer and Network Security

1 What Limits Performance? Stalls (Data Hazards) Data hazards - PowerPoint PPT Presentation

Instruction Scheduling Background: Pipelining Basics Last time Idea Register allocation Begin executing an instruction before completing the previous one Today Without Pipelining With Pipelining Instruction scheduling The

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan , Aaron Harlap

Ideas for evolution of replication technology @ CERN Openlab Minor Review December 14 th , 2010

The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann, Paul

Reconfigurable hardware for big ig data Gustavo Alonso Systems Group Department of Computer

The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

High availability and analysis of PostgreSQL Sergey Kalinin 18-19 of April 2012, dCache

Zerto Virtual Replication 4.5 Disaster Recovery Evolved Zerto provides enterprise-class, virtual

Replicating the Performance Evaluation of an N-Body Application on a Manycore Accelerator

ASTEROID AN ANALYZABLE, RESILIENT, EMBEDDED REAL-TIME OPERATING SYSTEM DESIGN Bj orn D

Content Replication in I2-DSI using Rsync+ Bert J Dempsey Debra Weiss University of North

an Object-Based File System for Large-Scale Federated IT Infrastructures Jan Stender, Zuse

MoSeL: A General, Extensible Modal Framework for Interactive Proofs in Separation Logic Robbert

Chapter 1 systems. Appreciate the evolution of computers. Introduction Understand the

IPv6 over Low power WPAN WG (6lowpan) Chairs: Geoff Mulligan &lt;geoff@mulligan.com&gt; Carsten

Gender Diversity at the Top: Good Intentions and Unexpected Consequences GENDER DIVERSITY

Message Authentication Codes (MACs) Tung Chou Technische Universiteit Eindhoven, The Netherlands

References Message Authentication Codes (MACs) Message Authentication Codes (MACs), Chapter

Objectives Security Notions of MACs NMACs and HMACs CBC-MACs Low Power Ajit Pal

MESSAGE AUTHENTICATION CODES and PRF DOMAIN EXTENSION The goal is to ensure that M really

3-509

CS 642: Midterm 1 Review Questions and General Study Pointers March 2020 1 Threat Modeling,

Tight PRF-Security of Double-block Hash-then-Sum MACs Seongkwang Kim, Byeonghak Lee , Jooyoung Lee

Lecture 7 - Applied Cryptography CSE497b - Spring 2007 Introduction Computer and Network Security

IPv6 over Low power WPAN WG (6lowpan) Chairs: Geoff Mulligan <geoff@mulligan.com> Carsten