Exploitation of instruction level parallelism Computer Architecture - PowerPoint PPT Presentation

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 1/55

Exploitation of instruction level parallelism Compilation techniques and ILP Compilation techniques and ILP 1 Advanced branch prediction techniques 2 3 Introduction to dynamic scheduling Speculation 4 Multiple issue techniques 5 6 ILP limits Thread level parallelism 7 Conclusion 8 cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 2/55

Exploitation of instruction level parallelism Compilation techniques and ILP Taking advantage of ILP ILP directly applicable to basic blocks. Basic block : sequence of instructions without branching. Typical program in MIPS: Basic block average size from 3 to 6 instructions. Low ILP exploitation within block. Need to exploit ILP across basic blocks. Loop level parallelism . Example Can be transformed to ILP . By compiler or hardware. for ( i=0;i<1000;i++) { Alternative : x[ i ] = x[ i ] + y[ i ]; } Vector instructions. SIMD instructions in processor. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 3/55

Exploitation of instruction level parallelism Compilation techniques and ILP Scheduling and loop unrolling Parallelism exploitation : Interleave execution of unrelated instructions. Fill stalls with instructions. Do not alter original program effects. Compiler can do this with detailed knowledge of the architecture. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 4/55

Exploitation of instruction level parallelism Compilation techniques and ILP ILP exploitation Example Each iteration body is for ( i=999;i>=0;i −− ) { independent. x[ i ] = x[ i ] + s; } Latencies between instructions Instruction Instruction Latency (cycles) producing result using result FP ALU operation FP ALU operation 3 FP ALU operation Store double 2 Load double FP ALU operation 1 Load double Store double 0 cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 5/55

Exploitation of instruction level parallelism Compilation techniques and ILP Compiled code R1 → Last array element. F2 → Scalar s . R2 → Precomputed so that 8(R2) is the first element in array. Assembler code Loop : L.D F0 , 0(R1) ; F0 < − x [ i ] ADD.D F4 , F0 , F2 ; F4 < − F0 + s S.D F4 , 0(R1) ; x [ i ] < − F4 DADDUI R1, R1, # − 8 ; i − − BNE R1, R2, Loop ; Branch i f R1!=R2 cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 6/55

Exploitation of instruction level parallelism Compilation techniques and ILP Stalls in execution Original Stalls Loop : L.D F0 , 0(R1) Loop : L.D F0 , 0(R1) < s t a l l > ADD.D F4 , F0 , F2 ADD.D F4 , F0 , F2 S.D F4 , 0(R1) < s t a l l > DADDUI R1, R1, # − 8 < s t a l l > BNE R1, R2, Loop S.D F4 , 0(R1) DADDUI R1, R1, # − 8 < s t a l l > BNE R1, R2, Loop cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 7/55

Exploitation of instruction level parallelism Compilation techniques and ILP Loop scheduling Original Scheduled Loop : L.D F0 , 0(R1) Loop : L.D F0 , 0(R1) DADDUI R1, R1, # − 8 < s t a l l > ADD.D F4 , F0 , F2 ADD.D F4 , F0 , F2 < s t a l l > < s t a l l > < s t a l l > < s t a l l > S.D F4 , 8(R1) S.D F4 , 0(R1) BNE R1, R2, Loop DADDUI R1, R1, # − 8 < s t a l l > BNE R1, R2, Loop 7 cycles per iteration. 9 cycles per iteration. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 8/55

Exploitation of instruction level parallelism Compilation techniques and ILP Loop unrolling Idea : Replicate loop body several times. Adjust termination code. Use different registers for each iteration replica to reduce dependencies. Effect : Increase basic block length. Increase use of available ILP . cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 9/55

Exploitation of instruction level parallelism Compilation techniques and ILP Unrolling Unrolling (x4) Unrolling (x4) Loop : L.D F0 , 0(R1) ADD.D F12 , F10 , F2 ADD.D F4 , F0 , F2 S.D F12 , − 16(R1) S.D F4 , 0(R1) L.D F14 , − 24(R1) L.D F6 , − 8(R1) ADD.D F16 , F14 , F2 ADD.D F8 , F6 , F2 S.D F16 , − 24(R1) S.D F8 , − 8(R1) DADDUI R1, R1, # − 32 L.D F10 , − 16(R1) BNE R1, R2, Loop 4 iterations require more registers. This example assumes that array size is multiple of 4. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 10/55

Exploitation of instruction level parallelism Compilation techniques and ILP Stalls and unrolling Unrolling (x4) Unrolling (x4) Loop : L.D F0 , 0(R1) ADD.D F12 , F10 , F2 < s t a l l > < s t a l l > ADD.D F4 , F0 , F2 < s t a l l > < s t a l l > S.D F12 , − 16(R1) < s t a l l > L.D F14 , − 24(R1) S.D F4 , 0(R1) < s t a l l > L.D F6 , − 8(R1) ADD.D F16 , F14 , F2 < s t a l l > < s t a l l > ADD.D F8 , F6 , F2 < s t a l l > < s t a l l > S.D F16 , − 24(R1) < s t a l l > DADDUI R1, R1, # − 32 S.D F8 , − 8(R1) < s t a l l > L.D F10 , − 16(R1) BNE R1, R2, Loop < s t a l l > 27 cycles for every 4 iterations → 6 . 75 cycles per iteration. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 11/55

Exploitation of instruction level parallelism Compilation techniques and ILP Scheduling and unrolling Code reorganization. Unrolling (x4) Preserve Loop : L.D F0 , 0(R1) dependencies. L.D F6 , − 8(R1) Semantically L.D F10 , − 16(R1) equivalent. L.D F14 , − 24(R1) ADD.D F4 , F0 , F2 Goal : Make use of ADD.D F8 , F6 , F2 stalls . ADD.D F12 , F10 , F2 ADD.D F16 , F14 , F2 Update of R1 at enough S.D F4 , 0(R1) distance from BNE . S.D F8 , − 8(R1) S.D F12 , − 16(R1) 14 cycles for every 4 DADDUI R1, R1, # − 32 iterations → 3 . 5 cycles S.D F16 , 8(R1) BNE R1, R2, Loop per iteration. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 12/55

Exploitation of instruction level parallelism Compilation techniques and ILP Limits of loop unrolling Improvement is decreased with each additional unrolling. Improvement limited to stalls removal. Overhead amortized among iterations. Increase in code size . May affect to instruction cache miss rate. Pressure on register file . May generate shortage of registers. Advantages are lost if there are not enough available registers. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 13/55

Exploitation of instruction level parallelism Advanced branch prediction techniques Compilation techniques and ILP 1 Advanced branch prediction techniques 2 3 Introduction to dynamic scheduling Speculation 4 Multiple issue techniques 5 6 ILP limits Thread level parallelism 7 Conclusion 8 cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 14/55

Exploitation of instruction level parallelism Advanced branch prediction techniques Branch prediction High impact of branches on programs performance. To reduce impact: Loop unrolling. Branch prediction: Compile time. Each branch handled isolated. Advanced branch prediction: Correlated predictors. Tournament predictors. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 15/55

Exploitation of instruction level parallelism Advanced branch prediction techniques Dynamic scheduling Hardware reorders instructions execution to reduce stalls while keeping data flow and exceptions. Able to handle unknown cases at compile time: Cache misses/hits. Code less dependent on a concrete pipeline. Simplifies compiler. Permits the hardware speculation . cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 16/55

Exploitation of instruction level parallelism Advanced branch prediction techniques Correlated prediction example If first and second branch are taken, if (a==2) { a=0; } third is NOT-taken. if (b==2) { b=0; } if (a!=b) { f () ; } Maintains last branches history to select among several predictors. A ( m , n ) predictor: Uses the result of m last branches to select among 2 m predictors. Each predictor has n bits. Predictor ( 1 , 2 ) : Result of last branch to select among 2 predictors. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 17/55

Exploitation of instruction level parallelism Advanced branch prediction techniques Size of predictor A predictor ( m , n ) has several entries for each branch address. Total size: S = 2 m × n × entries address Examples: ( 0 , 2 ) with 4K entries → 8 Kb ( 2 , 2 ) with 4K entries → 32 Kb ( 2 , 2 ) with 1K entries → 8 Kb cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 18/55

Exploitation of instruction level parallelism Computer Architecture - PowerPoint PPT Presentation

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito Singh Francisco Javier Garca Blas ARCOS Group Computer Science and

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 2 Chapter 2 Instruction-Level Parallelism and Its Exploitation p 1 Overview

Chapter 2 Instruction-Level Parallelism and Its E Exploitation l it ti 1 Overview

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 3) ILP vs. Parallel

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 3) ILP vs. Parallel

Instruction-Level Parallelism and Its Exploitation 1 MO401 Tpicos - estrutura IC-UNICAMP

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Dataflow Computers Motivation: exploit instruction-level parallelism on a massive scale

! Current State of Exploitation ! Return-Oriented Exploitation ! Mac OS X x86 Return-Oriented

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MALEEN ABEYDEERA, SUVINAY

INSTRUCTION LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing

CIS 371 Computer Organization and Design Unit 11: Static and Dynamic Scheduling Slides

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA MALEEN ABEYDEERA, SUVINAY

Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for

Memory Accesses in Out-of-Order Execution Nima Honarmand Spring 2016 :: CSE 502 Computer

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation

Reorder Density (RD) and Reorder Buffer- occupancy Density (RBD) : Metrics for packet reordering

Exploitation of instruction level parallelism Computer Architecture - PowerPoint PPT Presentation

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito Singh Francisco Javier Garca Blas ARCOS Group Computer Science and

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 2 Chapter 2 Instruction-Level Parallelism and Its Exploitation p 1 Overview

Chapter 2 Instruction-Level Parallelism and Its E Exploitation l it ti 1 Overview

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 3) ILP vs. Parallel

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 3) ILP vs. Parallel

Instruction-Level Parallelism and Its Exploitation 1 MO401 Tpicos - estrutura IC-UNICAMP

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Dataflow Computers Motivation: exploit instruction-level parallelism on a massive scale

! Current State of Exploitation ! Return-Oriented Exploitation ! Mac OS X x86 Return-Oriented

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MALEEN ABEYDEERA, SUVINAY

INSTRUCTION LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing

CIS 371 Computer Organization and Design Unit 11: Static and Dynamic Scheduling Slides

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA MALEEN ABEYDEERA, SUVINAY

Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for

Memory Accesses in Out-of-Order Execution Nima Honarmand Spring 2016 :: CSE 502 Computer

Tiling: A Data Locality Optimizing Algorithm Previously Kelly &amp; Pugh transformation

Reorder Density (RD) and Reorder Buffer- occupancy Density (RBD) : Metrics for packet reordering

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation