Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry - PowerPoint PPT Presentation

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 16 th Annual ACM International Conference on Supercomputing (ICS’02), June 24 th 2002 *supported in part by DARPA through the PAC-C program and NSF ICS’02 1

Outline ROB complexities Motivation for the low-complexity ROB Low-complexity ROB design Results Concluding remarks ICS’02 2

Pentium III-like Superscalar Datapath Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses ICS’02 3

ROB Port Requirements for a W-way CPU Decode/Dispatch Writeback W write ports W write ports to setup entries to write results ROB Dispatch/Issue Commit 2W read ports W read ports to read the source for instruction operands commitment ICS’02 4

Where are the Source Values Coming From? Function Architectural Units Instruction Issue Register File 1 2 IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX 3 Instruction D-cache dispatch Result/status forwarding buses ICS’02 5

Where are the Source Values Coming From ? Forwarding ARF ROB 62% 32% 6% 32% 100% 80% 60% 40% 20% 0% c p c f r f r u a d . 2 k x i t e m e . e t p c c c l p r p e s a e l k s i s n g m o p g g m v p a r f s e i i i g t I a w a w r p a g w z b r m . r u . g b a t o m a s p l g v e p q r v v v u A e e A A w p 96-entry ROB, 4-way processor SPEC2K Benchmarks ICS’02 6

How Efficiently are the Ports Used ? Decode/Dispatch Writeback W write ports W write ports to setup entries To write results ROB Dispatch/Issue Commit 2W read ports W read ports to read the source for instruction operands commitment 6% ICS’02 7

Approaches to Reducing ROB Complexity Reduce the number of read ports for reading out the source operand values More radical (and better): Completely eliminate the read ports for reading source operand values! ICS’02 8

Reducing the Number of Read Ports 1 read port 2 read ports 3.5% 1.0% Average IPC Drop: 16 12 8 Performance Drop % 4 0 bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. 20 16 12 8 4 0 applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS’02 9

Problems with Retaining Fewer Source Read Ports on the ROB Need arbitration for the small number of ports Additional logic needed to block the instructions which could not get the port. Need a switching network to route the operands to correct destinations Multi-cycle access still remains in the critical path of Dispatch/Issue logic ICS’02 10

Our Solution: Elimination of Read Ports Function Architectural Units Instruction Issue Register File 1 2 IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX 3 Instruction D-cache dispatch Result/status forwarding buses ICS’02 11

Our Solution: Elimination of Read Ports Function Architectural Units Instruction Issue Register File 1 2 IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX 3 Instruction D-cache dispatch Result/status forwarding buses ICS’02 12

Our Solution: Elimination of Read Ports Function Architectural Units Instruction Issue Register File 1 IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX 3 Instruction D-cache dispatch Result/status forwarding buses ICS’02 13

Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM Layout of a 16-ported SRAM bitcell bitcell Area Reduction – 71% Shorter bit and wordlines ICS’02 14

Our Solution: Elimination of Read Ports Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses Area Reduction – 45% ICS’02 15

Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation Power is reduced because: shorter bitlines and wordlines lower capacitive loading fewer decoders fewer drivers and sense amps ICS’02 16

Completely Eliminating the Source Read Ports on the ROB The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING ICS’02 17

Late Forwarding: Use the Normal Forwarding Buses! Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses: ICS’02 18

Late Forwarding: Use the Normal Forwarding Buses! Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses: ICS’02 19

Optimizing Late Forwarding PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance SOLUTION: Selective Late Forwarding (SLF) SLF requires additional bit in the ROB That bit is set by the dispatched instructions that require Late Forwarding No additional forwarding buses are needed, since SLF traffic is very small ICS’02 20

Late Forwarding: Use the Normal Forwarding Buses! Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction Only 3.5% of the D-cache dispatch traffic is from Result/status forwarding buses: SELECTIVE LATE FORWARDING ICS’02 21

Performance Drop of Simplified ROB No ROB read ports with SLF 1 read port 2 read ports 9.6% 3.5% 1.0% Average IPC Drop: 16 17% 12 Performance Drop % 8 4 0 bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. 30 37% 25 20 15 10 5 0 applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS’02 22

IPC Penalty: Source Value Not Accessible within the ROB Lifetime of a Result Value Late Forwarding/ Forwarding Commitment Value within ARF Result Generation Value within ROB time ICS’02 23

Improving IPC with No Read Ports Cache recently generated values in a set of RETENTION LATCHES (RL) Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports ICS’02 24

Datapath with the Retention Latches Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses ICS’02 25

Datapath with the Retention Latches RETENTION LATCHES Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses ICS’02 26

The Structure of the Retention Latch Set L recently-written results (L=1 or 2 works great) 8 or 16 latches L-ported CAM field Result Values Status (key = ROB_slot_id) W write ports for writing up L ROB slot addresses to W results in parallel (L=1 or 2) ICS’02 27

Retention Latch Management Strategies FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO ICS’02 28

Hit Ratios to Retention Latches FIFO 8 2 FIFO 16 2 LRU 8 2 LRU 16 2 42% 55% 56% 62% Average Hit Ratio: 100 80 60 40 20 Hit Ratios 0 bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. 100 80 60 40 20 0 applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS’02 29

Accessing Retention Latch Entries ROB index is used as a unique key in the Retention Latches to search the result values Need to maintain unique keys even when we have: Reuse of a ROB slot: Not a problem for FIFO simply flush a RL entry at commit time for LRU Branch mispredictions ICS’02 30

Handling Branch Mispredictions Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed Uses branch tags Complicated implementation Complete RL Flushing: All retention latch entries are flushed Very simple implementation Performance drop is only 1.5% compared to selective flushing ICS’02 31

Misprediction Handling: Performance Selective flushing Complete flushing 1.5% Average IPC Drop: 3.5 3 2.5 IPC 2 1.5 1 0.5 0 bzip gap gcc gzip mcf pars perl twol vort vpr appl apsi art equ mesa mgrid swim wupw Int. FP Avg. ICS’02 32

Experimental Setup: the AccuPower (DATE’02) Compiled Performance stats SPEC Microarchitectural benchmarks Simulator (Rooted in Datapath Transition counts, SimpleScalar) specs Context information Power/energy Energy/Power stats VLSI layout Estimator data SPICE SPICE deck SPICE measures of energy per transition ICS’02 33

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry - PowerPoint PPT Presentation

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 16 th Annual ACM International

Reorder Density (RD) and Reorder Buffer- occupancy Density (RBD) : Metrics for packet reordering

External buffer Raslan Darawsheh Mellanox External buffer First was introduced by Olivier

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register

TinyOS Determine when Fill message Specify Pass buffer message buffer Network Communication

Lab 2: Buffer Overflows Fengwei Zhang SUSTech CS 315 Computer Security 1 Buffer Overflows

Delta Pointers: Buffer Overflow Checks Without the Checks Tadde us Kroes & Koen Koning

Smashing the Buffer Smashing the Buffer Miroslav tampar Miroslav tampar (mstampar@zsis.hr )

Buffer Software Security overflows and other memory safety vulnerabilities Buffer overflow

Buffer Overflows with Content 2 A Process Stack Buffer Overflow Common Techniques employed

More Vulnerabilities (buffer overreads, format string, integer overflow, heap overflows) Chester

Introduction Buffer Overflows Buffer overflows were the most common form of security

a single gadget weird machine Framing Signals a return to portable shellcode Erik Bosman and

Week 03 Lectures PostgreSQL Buffer Manager 1/95 PostgreSQL buffer manager: provides a shared

Shared buffer laboratory 2 implements a shared buffer Process loop Ke yboard wait for

1 The Hardware: Reorder Buffer Branch Prediction vs. Precise Interrupt If inst write results in

E XTENDING S EMANTIC AND E PISODIC M EMORY TO S UPPORT R OBUST D ECISION M AKING

Lecture 4 Capacity of Wireless Channels I-Hsiang Wang ihwang@ntu.edu.tw 3/20,

Selective Private Function Evaluation Johan Wall en Based on Ran Canetti, Yuval Ishai, Ravi

Q4 Financial Results Fiscal 2016 Lee D. Rudow President and CEO Michael J. Tschiderer Chief

Commission Briefing on Human Capital and Equal Employment Opportunity (EEO) Office of Human

CONTENT Definition of nurse practitioners (NPs) Background for study Methods

Stephanie Nixon University of Toronto Funded by the Canadian Institutes of Health Research,

Evaluation 101: An Introduction for New Evaluation Practitioners AEA/CDC Summer Evaluation

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry - PowerPoint PPT Presentation

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 16 th Annual ACM International

Reorder Density (RD) and Reorder Buffer- occupancy Density (RBD) : Metrics for packet reordering

External buffer Raslan Darawsheh Mellanox External buffer First was introduced by Olivier

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register

TinyOS Determine when Fill message Specify Pass buffer message buffer Network Communication

Lab 2: Buffer Overflows Fengwei Zhang SUSTech CS 315 Computer Security 1 Buffer Overflows

Delta Pointers: Buffer Overflow Checks Without the Checks Tadde us Kroes &amp; Koen Koning

Smashing the Buffer Smashing the Buffer Miroslav tampar Miroslav tampar (mstampar@zsis.hr )

Buffer Software Security overflows and other memory safety vulnerabilities Buffer overflow

Buffer Overflows with Content 2 A Process Stack Buffer Overflow Common Techniques employed

More Vulnerabilities (buffer overreads, format string, integer overflow, heap overflows) Chester

Introduction Buffer Overflows Buffer overflows were the most common form of security

a single gadget weird machine Framing Signals a return to portable shellcode Erik Bosman and

Week 03 Lectures PostgreSQL Buffer Manager 1/95 PostgreSQL buffer manager: provides a shared

Shared buffer laboratory 2 implements a shared buffer Process loop Ke yboard wait for

1 The Hardware: Reorder Buffer Branch Prediction vs. Precise Interrupt If inst write results in

E XTENDING S EMANTIC AND E PISODIC M EMORY TO S UPPORT R OBUST D ECISION M AKING

Lecture 4 Capacity of Wireless Channels I-Hsiang Wang ihwang@ntu.edu.tw 3/20,

Selective Private Function Evaluation Johan Wall en Based on Ran Canetti, Yuval Ishai, Ravi

Q4 Financial Results Fiscal 2016 Lee D. Rudow President and CEO Michael J. Tschiderer Chief

Commission Briefing on Human Capital and Equal Employment Opportunity (EEO) Office of Human

CONTENT Definition of nurse practitioners (NPs) Background for study Methods

Stephanie Nixon University of Toronto Funded by the Canadian Institutes of Health Research,

Evaluation 101: An Introduction for New Evaluation Practitioners AEA/CDC Summer Evaluation

Delta Pointers: Buffer Overflow Checks Without the Checks Tadde us Kroes & Koen Koning