Instruction-Level Parallelism and Its Exploitation 1 MO401 Tpicos - PowerPoint PPT Presentation

MO401 IC-UNICAMP IC/Unicamp Prof Mario Côrtes Capítulo 3 – parte B (3.8 - 3.15): Instruction-Level Parallelism and Its Exploitation 1 MO401

Tópicos - estrutura IC-UNICAMP • Parte A – Basic compiler ILP – Advanced branch prediction – Dynamic scheduling – Hardware based speculation – Multiple issue and static scheduling • Parte B – Instruction delivery and speculation – Limitations of ILP – ILP and memory issues – Multithreading 2 MO401

Dynamic Scheduling, Multiple Issue, and Speculation 3.8 Dynamic Scheduling, Multiple Issue, and Speculation IC-UNICAMP • Até agora, vistos separadamente – Dynamic scheduling, multiple issue, speculation • Modern microarchitectures: – Dynamic scheduling + multiple issue + speculation • Hipótese simplificadora: 2 issues / ciclo • Extensão do alg. Tomasulo: multiple issue supersacalar pipeline, separate integer, LD/ST, FP units (add, mult) – FUs: initiate operation every clock • Issue to RS in-order. Any two operations (every cycle) 3 MO401

Dynamic Scheduling, Multiple Issue, and Speculation IC-UNICAMP Overview of Design New: issue and completion logic must support 2 instructions / clock cycle 4 MO401

Extended Tomasulo IC-UNICAMP • Multiple issue / cycle: muito complicado. – ex: as duas operações podem ter dependência e tabelas tem que ser atualizadas em paralelo (no mesmo clk) • Two approaches: – Assign reservation stations and update pipeline control table in half clock cycles • Only supports 2 instructions/clock – Design logic to handle any possible dependencies between the instructions – Hybrid approaches • Modern superscalar processors (4+ issues) use both: – Issue logic: wide and pipelined • Issue logic can become bottleneck – (ver Fig 3.18, para apenas um caso) 5 MO401

IC-UNICAMP Complexidade: apenas uma dependência ins1 = LD ins2 = op FP com operando fornecido pelo LD 6 MO401

Multiple Issue IC-UNICAMP • 1- Pre-assign a RS and ROB entry. Limit the number of instructions of a given class that can be issued in a “bundle” – I.e. on FP, one integer, one load, one store • 2- Examine all the dependencies among the instructions in the bundle • 3- If dependencies exist in bundle, encode them in reservation stations and ROB • All above: a single clock cycle • At pipeline backend: need multiple completion/commit – Easier, because dependences have already been dealt with • Intel i7 usa este esquema 7 MO401

Exmpl p 200: multiple issue with and without speculation IC-UNICAMP 8 MO401

No speculation IC-UNICAMP 9 MO401

With speculation IC-UNICAMP 10 MO401

3.9 Advanced Techniques IC-UNICAMP • Objetivo: possibilitar alta taxa de execução de instruções por ciclo – Increasing instruction delivery bandwidth – Advanced speculation techniques – Value prediction 11 MO401

Animações e simulações IC-UNICAMP • Ver site – http://www.williamstallings.com/COA/Animation/Links.html • Contém várias simulações: – Branch prediction – Branch Target Buffer – Loop unrolling – Pipeline with static vs. dynamic scheduling – Reorder Buffer Simulator – Scoreboarding technique for dynamic scheduling: – Tomasulo's Algorithm: 12 MO401

Increasing instruction fetch bandwidth IC-UNICAMP • Need high instruction bandwidth (from Instr. Cache to Issue Unit) – problema: como saber antes da decodificação se instrução é desvio e qual é o próximo PC? • Branch-Target buffers – Next PC prediction buffer, indexed by current PC • Diferenças com o branch prediction buffer já visto – branch prediction buffer: • após decodificação; só branches são tratados; index pode apontar para outro branch – no Branch-Target buffer • antes da decodificação; todas as instruções são tratadas ; “tag” do buffer identifica univocamente somente branches; somente “taken branches” são armazenados  demais instruções seguem o fetch normalmente 13 MO401

Adv. Techniques for Instruction Delivery and Speculation IC-UNICAMP Branch- Target Buffer 14 MO401

Adv. Techniques for Instruction Delivery and Speculation Branch-Target Buffer: steps IC-UNICAMP 15 MO401

Exmpl p205: penalidade IC-UNICAMP 16 MO401

Exmpl p205: penalidade IC-UNICAMP 17 MO401

Adv. Techniques for Instruction Delivery and Speculation Branch Folding IC-UNICAMP • Optimization: – Larger branch-target buffer – Add target instruction into buffer to deal with longer decoding time required by larger buffer – Allows “Branch folding” • Branch folding – With unconditional branch: o hardware permite “pular” o jump (cuja única função é mudar o PC) – In some cases, also with conditional branch 18 MO401

Adv. Techniques for Instruction Delivery and Speculation Return Address Predictor IC-UNICAMP • Most unconditional branches come from function returns – Indirect jump: JR (target muda em tempo de execução) – SPEC95: retorno de procedimento = 15% de todos os branches e aproximadamente 100% dos desvios incondicionais • The same procedure can be called from multiple sites – Causes the buffer to potentially forget about the return address from previous calls (changes at runtime) – SPEC CPU95: retorno de procedimento  misprediction = 40% • Create return address buffer organized as a stack – melhora consideravelmente o desempenho (fig 3.24) • (usado pelo Intel Core e AMD Phenom) 19 MO401

IC-UNICAMP Desempenho do Return Address Predictor Figure 3.24 Prediction accuracy for a return address buffer operated as a stack on a number of SPEC CPU95 benchmarks. The accuracy is the fraction of return addresses predicted correctly. A buffer of 0 entries implies that the standard branch prediction is used. Since call depths are typically not large, with some exceptions, a modest buffer works well. These data come from Skadron et al. [1999] and use a fix-up mechanism to prevent corruption of the cached return addresses. 20 MO401

Adv. Techniques for Instruction Delivery and Speculation Integrated Instruction Fetch Unit IC-UNICAMP • Design monolithic unit that performs: – Integrated branch prediction: • parte da instruction fetch – Instruction prefetch • Fetch ahead – Instruction memory access and buffering • Accessing multiple cache lines • Deal (hide) with crossing cache lines • (used by all high-end processors) 21 MO401

Register Renaming IC-UNICAMP • Register renaming vs. reorder buffers – Instead of virtual registers from reservation stations and reorder buffer, create a single register pool • Contains visible registers and virtual registers – Use hardware-based map to rename registers during issue – WAW and WAR hazards are avoided – Speculation recovery occurs by copying during commit – Still need a ROB-like queue to update table in order – Simplifies commit: • Record that mapping between architectural register and physical register is no longer speculative • Free up physical register used to hold older value • In other words: SWAP physical registers on commit – Physical register de-allocation is more difficult 22 MO401

Integrated Issue and Renaming IC-UNICAMP • Combining instruction issue with register renaming: – Issue logic pre-reserves enough physical registers for the bundle (ex: 4 registers for a 4 instruction bundle, 1 reg / result) – Issue logic finds dependencies within bundle, maps registers as necessary – Issue logic finds dependencies between current bundle and already in-flight bundles, maps registers as necessary • Como no ROB, o hardware deve determinar as dependências e atualizar as tabelas de renaming em um único clock – quanto maior o número de instruções emitidas por clock, mais complicado 23 MO401

How Much? IC-UNICAMP • How much to speculate – Mis-speculation degrades performance and power relative to no speculation • May cause additional misses (cache, TLB) – Prevent speculative code from causing higher costing misses (e.g. L2) • Speculating through multiple branches – Poderia ser útil em • very high branch frequency • branch clustering • long delay in FUs – Complicates speculation recovery (mas o resto seria simples) – Até 2011, esquema não utilizado comercialmente • No processor can resolve multiple branches per cycle 24 MO401

Adv. Techniques for Instruction Delivery and Speculation Energy Efficiency IC-UNICAMP • Custo energético da especulação errada – Trabalho inútil que deve ser descartado – Custo adicional da recuperação • Speculation and energy efficiency – Note: speculation is only energy efficient when it significantly improves performance • Se um número grande de instruções desnecessárias estão sendo executadas, é provável que, além do custo de energia, também o desempenho está piorando – fig 3.25  resultado ruim para inteiros  provável que cause baixa eficiência energética 25 MO401

Fração de instruções desnecessárias IC-UNICAMP Figure 3.25 The fraction of instructions that are executed as a result of misspeculation is typically much higher for integer Programs (the first five) versus FP programs (the last five). 26 MO401

Instruction-Level Parallelism and Its Exploitation 1 MO401 Tpicos - PowerPoint PPT Presentation

MO401 IC-UNICAMP IC/Unicamp Prof Mario Crtes Captulo 3 parte B (3.8 - 3.15): Instruction-Level Parallelism and Its Exploitation 1 MO401 Tpicos - estrutura IC-UNICAMP Parte A Basic compiler ILP Advanced branch prediction

Exploitation of instruction level parallelism Computer Architecture J. Daniel Garca Snchez

Chapter 2 Chapter 2 Instruction-Level Parallelism and Its Exploitation p 1 Overview

Chapter 2 Instruction-Level Parallelism and Its E Exploitation l it ti 1 Overview

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 3) ILP vs. Parallel

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 3) ILP vs. Parallel

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Dataflow Computers Motivation: exploit instruction-level parallelism on a massive scale

! Current State of Exploitation ! Return-Oriented Exploitation ! Mac OS X x86 Return-Oriented

DESIGNING SCIENCE PRESENTATIONS: A VISUAL GUIDE TO FIGURES, PAPERS, SLIDES, POSTERS, AND MORE

THE 21 ST CENTURY LEARNER Dr. Audrey J. Penner APLA Conference Charlottetown, May 2013 Locally

5/9/2018 1 Therapeutic index in radiotherapy

Grown up from Vim user to Vim plugin developer side IK : 2019-11-3 : VimConf2019 Designed by

Nordhaus-Gaddum inequalities for coloring games Cl ement Charpentier Joint work with Simone

Odds and ends on equivariant cohomology and traces Weizhe Zheng Columbia University

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr

Java Programming Unit 13 Working with Swing JTable