Regular Distributed Register Fabric Regular Distributed Register - PowerPoint PPT Presentation

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi- -Cycle Cycle and Synthesis for Multi Communications Communications Jason Cong, Yiping Fan, Xun Yang and Zhiru Zhang Jason Cong, {cong, fanyp fanyp, , yangxun yangxun, , zhiruz}@cs.ucla.edu zhiruz}@cs.ucla.edu {cong, Department of Computer Science Department of Computer Science University of California, Los Angeles University of California, Los Angeles Partially supported by NSF under award CCR-0096383, MARCO/DARPA GSRC, and Altera Corp. under the California MICRO program.

Outline Outline Needs for Multi- -Cycle On Cycle On- -Chip Communication Chip Communication � Needs for Multi � Contributions � Contributions � � Regular Distributed Register (RDR) Architecture Regular Distributed Register (RDR) Architecture � � MCAS: Architectural Synthesis for Multi MCAS: Architectural Synthesis for Multi- -Cycle Communication Cycle Communication � • Scheduling • Scheduling- -driven placement driven placement • Placement • Placement- -driven rescheduling & rebinding driven rescheduling & rebinding Experimental Results � Experimental Results � Conclusions & Future Work � Conclusions & Future Work �

Needs for Multi- -Cycle On Cycle On- -Chip Communication Chip Communication Needs for Multi � Interconnect delays dominate the timing in DSM tech. Interconnect delays dominate the timing in DSM tech. � � Single Single- -cycle full chip synchronization is no longer possible cycle full chip synchronization is no longer possible � 7 clock � NTRS’97 0.07um Tech 6 clock � 5 G Hz across-chip clock � 620 mm 2 (24.9mm x 24.9mm) � IPEM BIWS estimations 5 clock � Buffer size: 100x � Driver/receiver size: 100x � From corner to corner: � 7 clock cycles 4 clock 3 clock 1 clock 2 clock � Source: J. Cong, “Timing Closure 15.04 22.56 24.9 (mm) 0 7.52 Based on Physical Hierarchy,” ISPD’02.

Multi- -Cycle Interconnect Communication at Logic / Cycle Interconnect Communication at Logic / Multi Physical Level Physical Level � Simultaneous retiming + placement / Simultaneous retiming + placement / floorplanning floorplanning � � [Cong et al, ICCAD [Cong et al, ICCAD’ ’00] [Cong et al, DAC 00] [Cong et al, DAC’ ’03] 03] � � [ [Chong Chong & & Brayton Brayton, IWLS , IWLS’ ’01] 01] � � [Singh & Brown, FPGA [Singh & Brown, FPGA’ ’02] 02] � � Limitation: Limitation: � � Minimum clock period can be achieved by logic optimization is bo Minimum clock period can be achieved by logic optimization is bounded by unded by � max. delay- -to to- -register (DR) ratio of the loops in the circuits register (DR) ratio of the loops in the circuits max. delay • In a loop, 4 logic cells, 2 registers • Cell delay =1ns • Interconnect delay=4ns • DR ratio = (D logic +D int )/#Registers = (4+16)/2=10ns • Clock cycle >= 10ns

Our Contributions Our Contributions Regular Distributed Register (RDR) micro- -architecture architecture � Regular Distributed Register (RDR) micro � � Highly regular Highly regular � � Direct support of multi Direct support of multi- -cycle on cycle on- -chip communication chip communication � MCAS: Architectural Synthesis for Multi- -cycle cycle � MCAS: Architectural Synthesis for Multi � Communication Communication � Integrated architectural synthesis (e.g. resource binding, Integrated architectural synthesis (e.g. resource binding, � scheduling) with physical planning scheduling) with physical planning � Target at RDR architecture Target at RDR architecture �

Regular Distributed Register Architecture (1) Regular Distributed Register Architecture (1) Island Reg. file Reg. file Reg. file … … … MUL Register File MUX FSM FSM FSM ADD LCC LCC LCC …. Cluster with area constraint H i Local Global Interconnect FSM Computational Reg. file Cluster (LCC) Reg. file Reg. file … … … W i FSM FSM FSM LCC LCC LCC � Distribute registers to each “island” � Chose the island size such that local computation and communication in each island can be done in a single cycle: = + ≤ + + ≤ D D D D 2 D ( W H ) T − − − int ra island log ic opt int log ic opt int i i

Regular Distributed Register Architecture (2) Regular Distributed Register Architecture (2) Island Reg. file Reg. file Reg. file … … … MUL Register File MUX 1 cycle FSM 2 cycle k cycle FSM FSM ADD LCC LCC LCC …. Cluster with area constraint H i Local Global Interconnect FSM Computational Reg. file Cluster (LCC) Reg. file Reg. file … … … W i FSM FSM FSM LCC LCC LCC � Use register banks: � Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island � Highly regular

Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology � NTRS’97 70nm Tech � Chip dimension: 620 mm 2 (24.9mm x 24.9mm) � 5 G Hz across-chip clock • Can travel up to 7.52mm within 1 clock cycle under best interconnect optimization • Need 7 clock cycles to cross the chip � Each island base dimension • W i = H i =2.08mm • ≈ 1/3 of distance a wire can travel in 1 clock cycle • Logic volume: 6.76M min-size 2-NAND gates � 12X12 array of islands � Local registers are partitioned to 7 banks

RDR Architecture vs. DRA RDR Architecture vs. DRA � Distributed Register File Architecture (DRA) Distributed Register File Architecture (DRA) � � Behavior Behavior- -to to- -Placed RTL Synthesis with Performance Placed RTL Synthesis with Performance- -Driven Placement [Kim, Driven Placement [Kim, � et al, ICCAD’ et al, ICCAD ’01] 01] � Similarities: Similarities: � � Distribute registers near the local computational units Distribute registers near the local computational units � � Supports multi Supports multi- -cycle communication cycle communication � � Allows concurrent computation and communication Allows concurrent computation and communication � � Distinction: Distinction: � regular The RDR architecture is highly regular � The RDR architecture is highly � • Facilitates interconnect delay estimation • Facilitates interconnect delay estimation • Enables the systematic exploration of cycle • Enables the systematic exploration of cycle- -time/latency time/latency tradeoff by varying the size of the basic island tradeoff by varying the size of the basic island

Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling � Data flow graph extracted from discrete cosine transformation (DCT) � The nodes with the same color are assigned to the same functional unit. 1 ns Reg. file Reg. file resource delay num … … - - - 1 + 2 + + Alu1 multiplier 2 ns 2 Mul2 1,5,10 * * * 3 * * 4 * 3,7,12 Alu2 alu 1 ns 2 2,6,9 - - - - 5 - - 6 2 ns * * 7 * Reg. file Reg. file * * 8 * … … * 12 - * * * - - 9 * * 11 Mul1 LCC 4,8,11 - - - 10 Long interconnect Performance-driven Placement Short interconnect

Single- -cycle vs. Multi cycle vs. Multi- -cycle Interconnect Communication cycle Interconnect Communication Single represents registers Cycle1 + 2 - 1 + 2 Cycle1 - 1 Cycle2 * 3 * 3 * 4 * 4 Cycle3 Cycle2 - 5 Cycle4 - 5 - 6 Cycle3 - 6 Cycle5 Cycle4 * 7 * 8 * 7 * 11 Cycle6 * 8 * 12 * 12 Cycle7 Cycle5 * 11 Cycle8 - 10 - 10 Cycle6 - 9 Cycle9 - 9 � Single-cycle interconnect communication � Multi-cycle interconnect communication � Scheduled in 6 clock cycles � Scheduled in 9 clock cycles � Clock period is 4ns � Clock period is 2ns � Total latency is 24ns � Total latency is 18ns

Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization Reg. file Reg. file … … - 1 + 2 Cycle1 Mul2 Alu1 * 3 * 4 Cycle2 3,7,12 1,5,10 - 5 - 6 Cycle3 Cycle4 Reg. file Reg. file … … Cycle5 * 7 * 8 Alu2 Mul1 Cycle6 2,6,9 4,8,11 * 12 * 11 Cycle7 - 9 Simultaneous Placement and Scheduling Cycle8 - 10 � With placement integrated with scheduling, critical path is reduced. � The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. � The total latency is 16ns.

Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization Reg. file Reg. file … … - 1 + 2 Cycle1 Mul2 Alu1 3,7,11 1,5,10 * 3 * 4 Cycle2 - 5 - 6 Cycle3 Reg. file Reg. file … … Cycle4 Mul1 Alu2 * 7 * 12 Cycle5 4,8,12 2,6,9 Cycle6 * 8 * 11 Simultaneous Placement, Scheduling and Binding Cycle7 - 9 - 10 � With placement integrated with scheduling and binding, the critical path is further reduced. � The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. � The total latency is 14ns

Regular Distributed Register Fabric Regular Distributed Register - PowerPoint PPT Presentation

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi- -Cycle Cycle and Synthesis for Multi Communications Communications Jason Cong, Yiping Fan, Xun Yang and Zhiru Zhang Jason Cong, {cong, fanyp

Optimising fabric quality, finishing processes and machinery through the use of fabric objective

PRESENTATION Style; DGL 078 Fabric; 98 % cotton 2 % spandex Size Range: regular Price; USD

Optimising fabric quality, finishing processes and machinery through the use of fabric objective

FPGA fabric is eating the world The rise of the custom computing machines From the eyes of Steve

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout

Latvian Diabetes Register Eva Ramuse, Public health analyst of the Register Supervision Unit

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

V. Register Machine Yuxi Fu BASICS, Shanghai Jiao Tong University Register Machines are more

CS 3330: SEQ part 1 condition codes ( ZF , SF ) register input register output updates every

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Fibre to fabric Gary Robinson Wool industry consultant The transformation of raw wool to fabric

Paris COLLECTION 40mmx87mmx74mm Description fixed in base fixed in base covered fabric raised

Structure-aware Synthesis for Predictive Woven Fabric Appearance Shuang Zhao Wenzel Jakob Steve

THE EDGE AI FABRIC COMPANY J U L Y 1 6 , 2 0 1 8 Poi ntR Data I nc. 181 2 nd St. San Franci

Therm odynam ics Therm odynam ics and and Fabric of Spacetim e Fabric of Spacetim e Dm itri

THE EFFECT OF NUMBER OF FABRIC ON CFRP-FABRIC HYBRID COMPOSITE IMPACT SHIELD PERFORMANCE J. B.

Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs

inTouch : Designing a Mobile Coordination System Karen Tang 05-899: Ubicomp January 30, 2007

Instruction Scheduling Last week Register allocation Today Instruction scheduling

Task Monitoring and Rescheduling for Opportunity and Failure Management Jos Carlos Gonzlez,

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat

Poli 30D Political Inquiry Theory & Hypothesis Shane Xinyang Xuan ShaneXuan.com October 18,

U4 Program No coaches in U4 - Dutch Style Soccer 2 sessions on Tuesday Evenings

Traumatic Brain Injury Advisory Board Workgroup Virtual Meeting May 12, 2020 Welcome

Sambuz

Useful Links

Newsletter

Mail Us

Regular Distributed Register Fabric Regular Distributed Register - PowerPoint PPT Presentation

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi- -Cycle Cycle and Synthesis for Multi Communications Communications Jason Cong, Yiping Fan, Xun Yang and Zhiru Zhang Jason Cong, {cong, fanyp

Optimising fabric quality, finishing processes and machinery through the use of fabric objective

PRESENTATION Style; DGL 078 Fabric; 98 % cotton 2 % spandex Size Range: regular Price; USD

Optimising fabric quality, finishing processes and machinery through the use of fabric objective

FPGA fabric is eating the world The rise of the custom computing machines From the eyes of Steve

Control Unit Datapath Elements &amp; Single Cycle Datapath Unit Register Files Register Layout

Latvian Diabetes Register Eva Ramuse, Public health analyst of the Register Supervision Unit

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

V. Register Machine Yuxi Fu BASICS, Shanghai Jiao Tong University Register Machines are more

CS 3330: SEQ part 1 condition codes ( ZF , SF ) register input register output updates every

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Fibre to fabric Gary Robinson Wool industry consultant The transformation of raw wool to fabric

Paris COLLECTION 40mmx87mmx74mm Description fixed in base fixed in base covered fabric raised

Structure-aware Synthesis for Predictive Woven Fabric Appearance Shuang Zhao Wenzel Jakob Steve

THE EDGE AI FABRIC COMPANY J U L Y 1 6 , 2 0 1 8 Poi ntR Data I nc. 181 2 nd St. San Franci

Therm odynam ics Therm odynam ics and and Fabric of Spacetim e Fabric of Spacetim e Dm itri

THE EFFECT OF NUMBER OF FABRIC ON CFRP-FABRIC HYBRID COMPOSITE IMPACT SHIELD PERFORMANCE J. B.

Execu&amp;on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&amp;cs

inTouch : Designing a Mobile Coordination System Karen Tang 05-899: Ubicomp January 30, 2007

Instruction Scheduling Last week Register allocation Today Instruction scheduling

Task Monitoring and Rescheduling for Opportunity and Failure Management Jos Carlos Gonzlez,

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat

Poli 30D Political Inquiry Theory &amp; Hypothesis Shane Xinyang Xuan ShaneXuan.com October 18,

U4 Program No coaches in U4 - Dutch Style Soccer 2 sessions on Tuesday Evenings

Traumatic Brain Injury Advisory Board Workgroup Virtual Meeting May 12, 2020 Welcome

Sambuz

Useful Links

Newsletter

Mail Us

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout

Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs

Poli 30D Political Inquiry Theory & Hypothesis Shane Xinyang Xuan ShaneXuan.com October 18,