Banked Multiported Register Files for High-Frequency Superscalar - PowerPoint PPT Presentation

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng and Krste Asanovi � MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1

Motivation • Increasing demand on number of ports and number of registers in RF 512x64b a register file. 16R/8W • Growing concerns in DC 64KB access time, power, and die area. – Example: Alpha 21464 register file (RF) occupied over 5X the area of 64KB Alpha 21464 Floorplan primary data cache (DC). ISSCC, 2002 2

Distributed Architecture • Duplicated – Fewer Read Ports RF RF – Same Number of Write Ports – Twice Total Number of Registers ALU ALU – Alpha 21264 & Alpha 21464 Cluster 0 Cluster 1 • Non-Duplicated – Fewer Read Ports RF RF – Fewer Write Ports ALU ALU – Complex Inter-Cluster Communication Cluster 0 Cluster 1 Inter-Cluster Communication 3

Centralized Architecture • Multi-Level: RF Register File Cache – Fewer Read Ports RF – Fewer Write Ports – Control Logic Complexity ALU ALU – Poor Locality • One-Level Multi-Banked – Fewer Read Ports – Fewer Write Ports RF RF – Possible Conflicts – Control Logic Complexity – Possible Pipeline Stalls ALU ALU 4

Previous Work • Use minimal number of ports per register file banks: 1 or 2-read port(s) and 1-write port. • Avoid issuing instructions that would cause register file read conflicts. – Add complexity to the critical wakeup-select loop for the issue logic � slower cycle time • Resolve register file write conflicts by either delaying physical register allocation until write back stage or installing write buffers. – Complex pipeline control logic – Possible pipeline stalls 5

Our Work • Use more ports per register file bank: 2-read ports and 2-write ports. • Speculatively issue potentially conflicting instructions. – Minimize impact to the critical wakeup-select loop for the issue logic • Rapidly repair pipeline and reissue conflicting instructions when conflicts are detected after issue. – No write buffer requirement – No pipeline stalls Simpler and Faster Control Logic 6

Example • Four-issue superscalar machine with a 64x32b 8-banked register file. – Area Saving: 63% – Access Time Reduction: 25% – Energy Reduction: 40% – IPC Degradation: < 5% 7

Outline 1. Banked Register File Structure 2. Basic Pipeline Structure and Control Logic 3. Improving IPC • Bypass Skip • Read Sharing 4. Conclusion 8

Banked Register File Structure 64x32b 8B1R1W 64x32b 8B2R2W Bank 0 Bank 0 Bank 1 Bank 1 Bank 2 Bank 2 Bank 7 Bank 7 8-Read 8-Read 4-Write 4-Write ALU ALU ALU ALU ALU ALU ALU ALU 9

Register File Floorplan 64x32b 8-Read Ports & 4-Write Ports 123% storage array address decoder Area: 100% bank overhead column cell 37% 30% Baseline 8B8R4W 8B2R2W 8B1R1W 10

Baseline Pipeline Structure 0 1 2 3 4 5 6 Read Fetch Decode Rename Issue Execute Writeback Bypass • Issue Instruction window src1 src2 dst opcode – WAKEUP PHASE: Broadcasts the result 0 0 add 0 r2 1 r3 r9 tags of issued 1 1 sub 1 r3 1 5 r7 instructions to update 0 0 xor 0 r13 0 r17 r8 operand readiness. 1 1 beq 1 r3 1 r1 0 – SELECT PHASE: Picks a subset of ready instructions to 0 1 add 1 r24 1 r3 r17 issue. src ready bit issued bit ready bit 11

Modified Pipeline Structure 0 1 2 3 4 5 6 7 Read Fetch Decode Rename Issue Arbitrate Execute Writeback Bypass • Speculatively Issue Potentially Conflicting Instructions: Same Wakeup-Select Loop • Additional Arbitration Pipeline Stage – Detect readand writebank conflicts when too many instructions try to read from or write to the same register file bank. – Mux operand addresses into available register file ports. – Adds a cycle to branch misprediction latency. 12

N-way Arbitration 64x32b 8B2R2W • N-way Superscalar needs only an N-way Bank 0 arbitration for each bank port. Bank 1 • Example: 4-way Bank 2 inst1 inst2 inst3 inst4 0 src1 7 src1 2 src1 1 src1 Bank 7 4-way arbitration Left Operands 8-Read 4-Write 13

Pipeline Repair Operation Issue Read Arbitrate Bypass Wakeup Select Conflict Detected Kill Conflicting Instructions Wakeup Kill Following Issue Group Dependents Issue Read Arbitrate Bypass Wakeup Select Clear Ready Bits Issue Read Arbitrate Bypass Wakeup Select 14

Evaluating IPC Impact • IPC degradation simulation: modify Simplescalar simulator to keep track of a unified physical register file organized into banks. – Shorter access time of banked register files may lead to higher processor clock rate. • Benchmarks: Use a subset of SPEC2000 and Mediabench benchmarks that cover a range of different IPCs. 15

IPC Comparison (1) 2.5 Baseline 8B2R2W 2.0 1.5 1.0 0.5 0.0 bzip2 gcc gzip twolf vortex ijpeg adpcm avg • IPC degradation ranges from 0.1 (9%) to 0.5 (31%) with an average of 0.3 (17%). 16

Improving IPC • Avoid contending for register file read ports when it is possible. – Bypass Skip: Operands that will be sourced from the bypass network do not compete for access to the register file. – Read Sharing: Allow multiple instructions to read the same physical register from same bank. • Suggested in previous work [Park et. al. MICRO-35, Balasubramonian et. al. MICRO-34] 17

Bypass Skip Implementation • Need to determine bypassability before the arbitration for register file read ports. Issue Bypass? Arbitrate Read/Bypass – Problem: Extra pipeline stage, possible latency increase • Optimistic Bypass Hint: [Park et. al. 02’] Reducing register ports for higher speed and lower energy. MICRO-35. – Use wakeup tag search to indicate bypassability. – Bypassability indicator is not reset when the source instructions have written back to the register file. – Problem: Not always correct � could over subscribe the register file read ports. 18

Conservative Bypass Skip • Conservative Bypass Skip Scheme – Use wakeup tag search to indicate bypassability. – Only avoid read port contentions when the value is bypassed from the immediately preceding cycle. ALU RF 19

Bypass Bit Scheme tag match r9 tag match r3 src1 src2 dst opcode Instruction window 0 0 add 0 0 r2 1 1 0 0 0 r3 r3 r9 0 0 1 1 sub 1 0 1 0 r3 r3 1 0 5 r7 0 0 xor 0 0 r13 0 0 r17 r8 wakeup wakeup select 0 0 1 1 beq 0 1 0 1 r3 r3 1 0 r1 0 0 0 1 add 1 0 r24 0 1 1 0 0 r3 r3 r17 issued bit ready bit src ready bit src bypass bit 20

IPC Comparison (2) 2.5 Baseline 8B2R2W 2.0 8B2R2W+B 1.5 1.0 0.5 0.0 bzip2 gcc gzip twolf vortex ijpeg adpcm avg • Our conservative bypass skip scheme improves IPC by 5% on average. • IPC degradation ranges from <0.1 (9%) to 0.5 (28%) with an average of 0.2 (12%). 21

Read Sharing 64x32b 8B2R2W • A local port drives multiple global ports Bank 0 YES Bank 1 = Bank 2 inst1 inst2 inst3 inst4 2 src1 0 src1 6 src1 0 src1 Bank 7 4-way arbitration Left Operands 8-Read 4-Write 22

IPC Comparison (3) 2.5 Baseline 8B2R2W 2.0 8B2R2W+B 8B2R2W+B+S 1.5 1.0 0.5 0.0 bzip2 gcc gzip twolf vortex ijpeg adpcm avg • Adding read sharing improves IPC by another 7% on average. • IPC degradation ~0.1 across all the benchmarks with an average of <0.1 (5%). 23

Read Sharing Findings • Why are so many instructions reading the same register? 1. Groups of loadand store instructions that depend on the stack pointer tend to be issued together. (procedure call/return points) 2. Branch instructions that depend on the same register also tend to be issued together. • Confirms findings in previous work. – [Balasubramonian et. al. 01’] Reducing the complexity of the register file in dynamic superscalar processors. MICRO-34 – [Wallace et. al. ’96] A scalable register file architecture for dynamically scheduled processors. Proc. PACT. 24

IPC Sensitivity to Configuration conflict free select 8B2R 4W +B+S 8B2R2W+B+S 99.2% 95.4% +4% +0% -12% 8B2R2W+B+S -4% 8B2R2W -B-S 8B2R 1W +B+S 95.1% Baseline IPC 83.2% 91.2% -3% -7% -3% 8B2R2W+B -S 4B 2R2W+B+S 8B2R2W -B +S 88.4% 92.3% 92.1% 25

Register File Characteristics • Area: Magic, 0.25 � m TSMC CMOS process • Delay & Energy: HSPICE, 2.5V supply voltage 64x32b, 8 Read Ports & 4 Write Ports Type Baseline 8B8R4W 8B2R2W 8B2R1W 8B1R1W Area 100% 123% 37% 32% 30% Delay 100% 83% 75% 75% 77% Energy 100% 61% 59% 58% 41% Packing Bitline 26

Errata • Corrected Table 2 Delay 8r4w 2r2w 2r1w 1r1w 1 bank 100.00% -- -- -- 4 bank 92.38% 79.05% 79.05% 81.90% 8 bank 83.88% 74.76% 74.76% 77.14% http://www.cag.lcs.mit.edu/scale/ 27

Discussion • Why Design with Multi-Banked Register File? – Reduce Area Dramatically – Reduce Access Time � Higher Clock Rate – Reduce Energy Consumption – Cause Only Slight IPC Degradation – Scale With Technology • Wire Delay • Leakage Power • Future Work: – SMT Architecture 28

Banked Multiported Register Files for High-Frequency Superscalar - PowerPoint PPT Presentation

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng and Krste Asanovi MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation Increasing demand on number of

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout

Datapath Elements & Single Cycle Datapath Unit Chapter 11 Datapath Elements Introduction

An Exact Polynomial Time Algorithm for Clock Tree Sizing for Register Files Alexander Berkovich,

High Frequency Trading and the Flash Crash The Flash Crash: The Impact of High Frequency

Where are we? Subsystem Design Registers and Register Files Adders and ALUs Simple

Where are we? Data Path Design Subsystem Design Registers and Register Files Adders

Computational high frequency waves through interfaces/barriers Shi Jin University of

New Materials: 4S2F Low cost high frequency EMI material NiZn material Cost effective EMI

Impact of Temperature and High Vibration in Ground, Shipboard and Aircraft Platforms on

High Frequency Analysis with Advanced Technologies Dave Druiff Emerson Process Management Asset

HFO Studies 2014, 2015 New Generation Neonatal High Frequency Ventilators: Effect of

Complex Event Processing: DSL for High Frequency Trading Richard

High Frequency Cylindrical Intercept Array a new high sensitivity broadband intercept sensor

Modeling and Characterization of High Frequency Effects in ULSI Interconnects Narain Arora and

Processing of physiological signals by biochemical systems: emergence of high frequency waves from

Vestibular and Auditory Sensory Systems Auditory Modulation difficulties Low Frequency:

XDP (eXpress Data Path) as a building block for other FOSS projects Jesper Dangaard Brouer (Red

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) , Eli Bendersky, Mark

A GHOST FROM POSTSCRIPT for RUXCON 2017 A GHOST FROM POSTSCRIPT WHO ARE WE redrain

Practical Attacks against Mobile Device Management Solutions Michael Shaulov, CEO

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs),

Buffer Pools Lecture # 05 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645

for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki 1 Computation & Memory

Banked Multiported Register Files for High-Frequency Superscalar - PowerPoint PPT Presentation

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng and Krste Asanovi MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation Increasing demand on number of

Control Unit Datapath Elements &amp; Single Cycle Datapath Unit Register Files Register Layout

Datapath Elements &amp; Single Cycle Datapath Unit Chapter 11 Datapath Elements Introduction

An Exact Polynomial Time Algorithm for Clock Tree Sizing for Register Files Alexander Berkovich,

High Frequency Trading and the Flash Crash The Flash Crash: The Impact of High Frequency

Where are we? Subsystem Design Registers and Register Files Adders and ALUs Simple

Where are we? Data Path Design Subsystem Design Registers and Register Files Adders

Computational high frequency waves through interfaces/barriers Shi Jin University of

New Materials: 4S2F Low cost high frequency EMI material NiZn material Cost effective EMI

Impact of Temperature and High Vibration in Ground, Shipboard and Aircraft Platforms on

High Frequency Analysis with Advanced Technologies Dave Druiff Emerson Process Management Asset

HFO Studies 2014, 2015 New Generation Neonatal High Frequency Ventilators: Effect of

Complex Event Processing: DSL for High Frequency Trading Richard

High Frequency Cylindrical Intercept Array a new high sensitivity broadband intercept sensor

Modeling and Characterization of High Frequency Effects in ULSI Interconnects Narain Arora and

Processing of physiological signals by biochemical systems: emergence of high frequency waves from

Vestibular and Auditory Sensory Systems Auditory Modulation difficulties Low Frequency:

XDP (eXpress Data Path) as a building block for other FOSS projects Jesper Dangaard Brouer (Red

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) , Eli Bendersky, Mark

A GHOST FROM POSTSCRIPT for RUXCON 2017 A GHOST FROM POSTSCRIPT WHO ARE WE redrain

Practical Attacks against Mobile Device Management Solutions Michael Shaulov, CEO

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs),

Buffer Pools Lecture # 05 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645

for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki 1 Computation &amp; Memory

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout

Datapath Elements & Single Cycle Datapath Unit Chapter 11 Datapath Elements Introduction

for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki 1 Computation & Memory