banked multiported register files for high frequency
play

Banked Multiported Register Files for High-Frequency Superscalar - PowerPoint PPT Presentation

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng and Krste Asanovi MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation Increasing demand on number of


  1. Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng and Krste Asanovi � MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1

  2. Motivation • Increasing demand on number of ports and number of registers in RF 512x64b a register file. 16R/8W • Growing concerns in DC 64KB access time, power, and die area. – Example: Alpha 21464 register file (RF) occupied over 5X the area of 64KB Alpha 21464 Floorplan primary data cache (DC). ISSCC, 2002 2

  3. Distributed Architecture • Duplicated – Fewer Read Ports RF RF – Same Number of Write Ports – Twice Total Number of Registers ALU ALU – Alpha 21264 & Alpha 21464 Cluster 0 Cluster 1 • Non-Duplicated – Fewer Read Ports RF RF – Fewer Write Ports ALU ALU – Complex Inter-Cluster Communication Cluster 0 Cluster 1 Inter-Cluster Communication 3

  4. Centralized Architecture • Multi-Level: RF Register File Cache – Fewer Read Ports RF – Fewer Write Ports – Control Logic Complexity ALU ALU – Poor Locality • One-Level Multi-Banked – Fewer Read Ports – Fewer Write Ports RF RF – Possible Conflicts – Control Logic Complexity – Possible Pipeline Stalls ALU ALU 4

  5. Previous Work • Use minimal number of ports per register file banks: 1 or 2-read port(s) and 1-write port. • Avoid issuing instructions that would cause register file read conflicts. – Add complexity to the critical wakeup-select loop for the issue logic � slower cycle time • Resolve register file write conflicts by either delaying physical register allocation until write back stage or installing write buffers. – Complex pipeline control logic – Possible pipeline stalls 5

  6. Our Work • Use more ports per register file bank: 2-read ports and 2-write ports. • Speculatively issue potentially conflicting instructions. – Minimize impact to the critical wakeup-select loop for the issue logic • Rapidly repair pipeline and reissue conflicting instructions when conflicts are detected after issue. – No write buffer requirement – No pipeline stalls Simpler and Faster Control Logic 6

  7. Example • Four-issue superscalar machine with a 64x32b 8-banked register file. – Area Saving: 63% – Access Time Reduction: 25% – Energy Reduction: 40% – IPC Degradation: < 5% 7

  8. Outline 1. Banked Register File Structure 2. Basic Pipeline Structure and Control Logic 3. Improving IPC • Bypass Skip • Read Sharing 4. Conclusion 8

  9. Banked Register File Structure 64x32b 8B1R1W 64x32b 8B2R2W Bank 0 Bank 0 Bank 1 Bank 1 Bank 2 Bank 2 Bank 7 Bank 7 8-Read 8-Read 4-Write 4-Write ALU ALU ALU ALU ALU ALU ALU ALU 9

  10. Register File Floorplan 64x32b 8-Read Ports & 4-Write Ports 123% storage array address decoder Area: 100% bank overhead column cell 37% 30% Baseline 8B8R4W 8B2R2W 8B1R1W 10

  11. Baseline Pipeline Structure 0 1 2 3 4 5 6 Read Fetch Decode Rename Issue Execute Writeback Bypass • Issue Instruction window src1 src2 dst opcode – WAKEUP PHASE: Broadcasts the result 0 0 add 0 r2 1 r3 r9 tags of issued 1 1 sub 1 r3 1 5 r7 instructions to update 0 0 xor 0 r13 0 r17 r8 operand readiness. 1 1 beq 1 r3 1 r1 0 – SELECT PHASE: Picks a subset of ready instructions to 0 1 add 1 r24 1 r3 r17 issue. src ready bit issued bit ready bit 11

  12. Modified Pipeline Structure 0 1 2 3 4 5 6 7 Read Fetch Decode Rename Issue Arbitrate Execute Writeback Bypass • Speculatively Issue Potentially Conflicting Instructions: Same Wakeup-Select Loop • Additional Arbitration Pipeline Stage – Detect readand writebank conflicts when too many instructions try to read from or write to the same register file bank. – Mux operand addresses into available register file ports. – Adds a cycle to branch misprediction latency. 12

  13. N-way Arbitration 64x32b 8B2R2W • N-way Superscalar needs only an N-way Bank 0 arbitration for each bank port. Bank 1 • Example: 4-way Bank 2 inst1 inst2 inst3 inst4 0 src1 7 src1 2 src1 1 src1 Bank 7 4-way arbitration Left Operands 8-Read 4-Write 13

  14. Pipeline Repair Operation Issue Read Arbitrate Bypass Wakeup Select Conflict Detected Kill Conflicting Instructions Wakeup Kill Following Issue Group Dependents Issue Read Arbitrate Bypass Wakeup Select Clear Ready Bits Issue Read Arbitrate Bypass Wakeup Select 14

  15. Evaluating IPC Impact • IPC degradation simulation: modify Simplescalar simulator to keep track of a unified physical register file organized into banks. – Shorter access time of banked register files may lead to higher processor clock rate. • Benchmarks: Use a subset of SPEC2000 and Mediabench benchmarks that cover a range of different IPCs. 15

  16. IPC Comparison (1) 2.5 Baseline 8B2R2W 2.0 1.5 1.0 0.5 0.0 bzip2 gcc gzip twolf vortex ijpeg adpcm avg • IPC degradation ranges from 0.1 (9%) to 0.5 (31%) with an average of 0.3 (17%). 16

  17. Improving IPC • Avoid contending for register file read ports when it is possible. – Bypass Skip: Operands that will be sourced from the bypass network do not compete for access to the register file. – Read Sharing: Allow multiple instructions to read the same physical register from same bank. • Suggested in previous work [Park et. al. MICRO-35, Balasubramonian et. al. MICRO-34] 17

  18. Bypass Skip Implementation • Need to determine bypassability before the arbitration for register file read ports. Issue Bypass? Arbitrate Read/Bypass – Problem: Extra pipeline stage, possible latency increase • Optimistic Bypass Hint: [Park et. al. 02’] Reducing register ports for higher speed and lower energy. MICRO-35. – Use wakeup tag search to indicate bypassability. – Bypassability indicator is not reset when the source instructions have written back to the register file. – Problem: Not always correct � could over subscribe the register file read ports. 18

  19. Conservative Bypass Skip • Conservative Bypass Skip Scheme – Use wakeup tag search to indicate bypassability. – Only avoid read port contentions when the value is bypassed from the immediately preceding cycle. ALU RF 19

  20. Bypass Bit Scheme tag match r9 tag match r3 src1 src2 dst opcode Instruction window 0 0 add 0 0 r2 1 1 0 0 0 r3 r3 r9 0 0 1 1 sub 1 0 1 0 r3 r3 1 0 5 r7 0 0 xor 0 0 r13 0 0 r17 r8 wakeup wakeup select 0 0 1 1 beq 0 1 0 1 r3 r3 1 0 r1 0 0 0 1 add 1 0 r24 0 1 1 0 0 r3 r3 r17 issued bit ready bit src ready bit src bypass bit 20

  21. IPC Comparison (2) 2.5 Baseline 8B2R2W 2.0 8B2R2W+B 1.5 1.0 0.5 0.0 bzip2 gcc gzip twolf vortex ijpeg adpcm avg • Our conservative bypass skip scheme improves IPC by 5% on average. • IPC degradation ranges from <0.1 (9%) to 0.5 (28%) with an average of 0.2 (12%). 21

  22. Read Sharing 64x32b 8B2R2W • A local port drives multiple global ports Bank 0 YES Bank 1 = Bank 2 inst1 inst2 inst3 inst4 2 src1 0 src1 6 src1 0 src1 Bank 7 4-way arbitration Left Operands 8-Read 4-Write 22

  23. IPC Comparison (3) 2.5 Baseline 8B2R2W 2.0 8B2R2W+B 8B2R2W+B+S 1.5 1.0 0.5 0.0 bzip2 gcc gzip twolf vortex ijpeg adpcm avg • Adding read sharing improves IPC by another 7% on average. • IPC degradation ~0.1 across all the benchmarks with an average of <0.1 (5%). 23

  24. Read Sharing Findings • Why are so many instructions reading the same register? 1. Groups of loadand store instructions that depend on the stack pointer tend to be issued together. (procedure call/return points) 2. Branch instructions that depend on the same register also tend to be issued together. • Confirms findings in previous work. – [Balasubramonian et. al. 01’] Reducing the complexity of the register file in dynamic superscalar processors. MICRO-34 – [Wallace et. al. ’96] A scalable register file architecture for dynamically scheduled processors. Proc. PACT. 24

  25. IPC Sensitivity to Configuration conflict free select 8B2R 4W +B+S 8B2R2W+B+S 99.2% 95.4% +4% +0% -12% 8B2R2W+B+S -4% 8B2R2W -B-S 8B2R 1W +B+S 95.1% Baseline IPC 83.2% 91.2% -3% -7% -3% 8B2R2W+B -S 4B 2R2W+B+S 8B2R2W -B +S 88.4% 92.3% 92.1% 25

  26. Register File Characteristics • Area: Magic, 0.25 � m TSMC CMOS process • Delay & Energy: HSPICE, 2.5V supply voltage 64x32b, 8 Read Ports & 4 Write Ports Type Baseline 8B8R4W 8B2R2W 8B2R1W 8B1R1W Area 100% 123% 37% 32% 30% Delay 100% 83% 75% 75% 77% Energy 100% 61% 59% 58% 41% Packing Bitline 26

  27. Errata • Corrected Table 2 Delay 8r4w 2r2w 2r1w 1r1w 1 bank 100.00% -- -- -- 4 bank 92.38% 79.05% 79.05% 81.90% 8 bank 83.88% 74.76% 74.76% 77.14% http://www.cag.lcs.mit.edu/scale/ 27

  28. Discussion • Why Design with Multi-Banked Register File? – Reduce Area Dramatically – Reduce Access Time � Higher Clock Rate – Reduce Energy Consumption – Cause Only Slight IPC Degradation – Scale With Technology • Wire Delay • Leakage Power • Future Work: – SMT Architecture 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend