regular distributed register fabric regular distributed
play

Regular Distributed Register Fabric Regular Distributed Register - PowerPoint PPT Presentation

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi- -Cycle Cycle and Synthesis for Multi Communications Communications Jason Cong, Yiping Fan, Xun Yang and Zhiru Zhang Jason Cong, {cong, fanyp


  1. Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi- -Cycle Cycle and Synthesis for Multi Communications Communications Jason Cong, Yiping Fan, Xun Yang and Zhiru Zhang Jason Cong, {cong, fanyp fanyp, , yangxun yangxun, , zhiruz}@cs.ucla.edu zhiruz}@cs.ucla.edu {cong, Department of Computer Science Department of Computer Science University of California, Los Angeles University of California, Los Angeles Partially supported by NSF under award CCR-0096383, MARCO/DARPA GSRC, and Altera Corp. under the California MICRO program.

  2. Outline Outline Needs for Multi- -Cycle On Cycle On- -Chip Communication Chip Communication � Needs for Multi � Contributions � Contributions � � Regular Distributed Register (RDR) Architecture Regular Distributed Register (RDR) Architecture � � MCAS: Architectural Synthesis for Multi MCAS: Architectural Synthesis for Multi- -Cycle Communication Cycle Communication � • Scheduling • Scheduling- -driven placement driven placement • Placement • Placement- -driven rescheduling & rebinding driven rescheduling & rebinding Experimental Results � Experimental Results � Conclusions & Future Work � Conclusions & Future Work �

  3. Needs for Multi- -Cycle On Cycle On- -Chip Communication Chip Communication Needs for Multi � Interconnect delays dominate the timing in DSM tech. Interconnect delays dominate the timing in DSM tech. � � Single Single- -cycle full chip synchronization is no longer possible cycle full chip synchronization is no longer possible � 7 clock � NTRS’97 0.07um Tech 6 clock � 5 G Hz across-chip clock � 620 mm 2 (24.9mm x 24.9mm) � IPEM BIWS estimations 5 clock � Buffer size: 100x � Driver/receiver size: 100x � From corner to corner: � 7 clock cycles 4 clock 3 clock 1 clock 2 clock � Source: J. Cong, “Timing Closure 15.04 22.56 24.9 (mm) 0 7.52 Based on Physical Hierarchy,” ISPD’02.

  4. Multi- -Cycle Interconnect Communication at Logic / Cycle Interconnect Communication at Logic / Multi Physical Level Physical Level � Simultaneous retiming + placement / Simultaneous retiming + placement / floorplanning floorplanning � � [Cong et al, ICCAD [Cong et al, ICCAD’ ’00] [Cong et al, DAC 00] [Cong et al, DAC’ ’03] 03] � � [ [Chong Chong & & Brayton Brayton, IWLS , IWLS’ ’01] 01] � � [Singh & Brown, FPGA [Singh & Brown, FPGA’ ’02] 02] � � Limitation: Limitation: � � Minimum clock period can be achieved by logic optimization is bo Minimum clock period can be achieved by logic optimization is bounded by unded by � max. delay- -to to- -register (DR) ratio of the loops in the circuits register (DR) ratio of the loops in the circuits max. delay • In a loop, 4 logic cells, 2 registers • Cell delay =1ns • Interconnect delay=4ns • DR ratio = (D logic +D int )/#Registers = (4+16)/2=10ns • Clock cycle >= 10ns

  5. Our Contributions Our Contributions Regular Distributed Register (RDR) micro- -architecture architecture � Regular Distributed Register (RDR) micro � � Highly regular Highly regular � � Direct support of multi Direct support of multi- -cycle on cycle on- -chip communication chip communication � MCAS: Architectural Synthesis for Multi- -cycle cycle � MCAS: Architectural Synthesis for Multi � Communication Communication � Integrated architectural synthesis (e.g. resource binding, Integrated architectural synthesis (e.g. resource binding, � scheduling) with physical planning scheduling) with physical planning � Target at RDR architecture Target at RDR architecture �

  6. Regular Distributed Register Architecture (1) Regular Distributed Register Architecture (1) Island Reg. file Reg. file Reg. file … … … MUL Register File MUX FSM FSM FSM ADD LCC LCC LCC …. Cluster with area constraint H i Local Global Interconnect FSM Computational Reg. file Cluster (LCC) Reg. file Reg. file … … … W i FSM FSM FSM LCC LCC LCC � Distribute registers to each “island” � Chose the island size such that local computation and communication in each island can be done in a single cycle: = + ≤ + + ≤ D D D D 2 D ( W H ) T − − − int ra island log ic opt int log ic opt int i i

  7. Regular Distributed Register Architecture (2) Regular Distributed Register Architecture (2) Island Reg. file Reg. file Reg. file … … … MUL Register File MUX 1 cycle FSM 2 cycle k cycle FSM FSM ADD LCC LCC LCC …. Cluster with area constraint H i Local Global Interconnect FSM Computational Reg. file Cluster (LCC) Reg. file Reg. file … … … W i FSM FSM FSM LCC LCC LCC � Use register banks: � Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island � Highly regular

  8. Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology � NTRS’97 70nm Tech � Chip dimension: 620 mm 2 (24.9mm x 24.9mm) � 5 G Hz across-chip clock • Can travel up to 7.52mm within 1 clock cycle under best interconnect optimization • Need 7 clock cycles to cross the chip � Each island base dimension • W i = H i =2.08mm • ≈ 1/3 of distance a wire can travel in 1 clock cycle • Logic volume: 6.76M min-size 2-NAND gates � 12X12 array of islands � Local registers are partitioned to 7 banks

  9. RDR Architecture vs. DRA RDR Architecture vs. DRA � Distributed Register File Architecture (DRA) Distributed Register File Architecture (DRA) � � Behavior Behavior- -to to- -Placed RTL Synthesis with Performance Placed RTL Synthesis with Performance- -Driven Placement [Kim, Driven Placement [Kim, � et al, ICCAD’ et al, ICCAD ’01] 01] � Similarities: Similarities: � � Distribute registers near the local computational units Distribute registers near the local computational units � � Supports multi Supports multi- -cycle communication cycle communication � � Allows concurrent computation and communication Allows concurrent computation and communication � � Distinction: Distinction: � regular The RDR architecture is highly regular � The RDR architecture is highly � • Facilitates interconnect delay estimation • Facilitates interconnect delay estimation • Enables the systematic exploration of cycle • Enables the systematic exploration of cycle- -time/latency time/latency tradeoff by varying the size of the basic island tradeoff by varying the size of the basic island

  10. Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling � Data flow graph extracted from discrete cosine transformation (DCT) � The nodes with the same color are assigned to the same functional unit. 1 ns Reg. file Reg. file resource delay num … … - - - 1 + 2 + + Alu1 multiplier 2 ns 2 Mul2 1,5,10 * * * 3 * * 4 * 3,7,12 Alu2 alu 1 ns 2 2,6,9 - - - - 5 - - 6 2 ns * * 7 * Reg. file Reg. file * * 8 * … … * 12 - * * * - - 9 * * 11 Mul1 LCC 4,8,11 - - - 10 Long interconnect Performance-driven Placement Short interconnect

  11. Single- -cycle vs. Multi cycle vs. Multi- -cycle Interconnect Communication cycle Interconnect Communication Single represents registers Cycle1 + 2 - 1 + 2 Cycle1 - 1 Cycle2 * 3 * 3 * 4 * 4 Cycle3 Cycle2 - 5 Cycle4 - 5 - 6 Cycle3 - 6 Cycle5 Cycle4 * 7 * 8 * 7 * 11 Cycle6 * 8 * 12 * 12 Cycle7 Cycle5 * 11 Cycle8 - 10 - 10 Cycle6 - 9 Cycle9 - 9 � Single-cycle interconnect communication � Multi-cycle interconnect communication � Scheduled in 6 clock cycles � Scheduled in 9 clock cycles � Clock period is 4ns � Clock period is 2ns � Total latency is 24ns � Total latency is 18ns

  12. Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization Reg. file Reg. file … … - 1 + 2 Cycle1 Mul2 Alu1 * 3 * 4 Cycle2 3,7,12 1,5,10 - 5 - 6 Cycle3 Cycle4 Reg. file Reg. file … … Cycle5 * 7 * 8 Alu2 Mul1 Cycle6 2,6,9 4,8,11 * 12 * 11 Cycle7 - 9 Simultaneous Placement and Scheduling Cycle8 - 10 � With placement integrated with scheduling, critical path is reduced. � The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. � The total latency is 16ns.

  13. Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization Reg. file Reg. file … … - 1 + 2 Cycle1 Mul2 Alu1 3,7,11 1,5,10 * 3 * 4 Cycle2 - 5 - 6 Cycle3 Reg. file Reg. file … … Cycle4 Mul1 Alu2 * 7 * 12 Cycle5 4,8,12 2,6,9 Cycle6 * 8 * 11 Simultaneous Placement, Scheduling and Binding Cycle7 - 9 - 10 � With placement integrated with scheduling and binding, the critical path is further reduced. � The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. � The total latency is 14ns

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend