Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic - - PowerPoint PPT Presentation
Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic - - PowerPoint PPT Presentation
Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic Dataflows by Leveraging Clocked EDA Mahdi Jelodari Mamaghani, Jim Garside, Will Toms, & Doug Edwards & Doug Edwards Verona, Italy 29 th August 2014 Motivation:
Motivation: Automatic GALSification of Control-driven Systems
Partitioning a control-driven system at behavioural level is complicated Detecting signal correspondence between data and control path is error prone Presence of global control impedes the detection process [1,2] Presence of global control impedes the detection process [1,2] Balsa, Petrify, VeriSyn & AVS : Popular Control-driven synthesis tools GCD a data-dependent loop example:
[GCD Control Path] [GCD Data Path]
The Advanced Processor Technologies Research Group
Concurrent dataflow Specification
- f a System
Motivation: Automatic GALSification of Data-driven Systems
Fine-grained Dataflow Synthesis
3 The Advanced Processor Technologies Research Group
Automatically Partitioning the System into multiple clocked islands
Released in 2010 as a Dataflow Syntax-directed Synthesis backend for Balsa language [ACSD’09]. Some of the properties of Teak Dataflow Networks (TDNs):
Teak: Asynchronous Dataflow Backend for Balsa Language
Some of the properties of Teak Dataflow Networks (TDNs):
- Communication :
Point-to-point communication between computation blocks. Slack elastic channels are capable of storing ‘any number’ of tokens.
- Computational :
- Computational :
Macro-module style with separate Go and Done activation signals. These
modules are chained in sequence or parallel according to the source level directives.
Dataflow which realises data-dependent computation.
4 The Advanced Processor Technologies Research Group
Teak: Behavioural Synthesis Flow
Syntax Directed
5 The Advanced Processor Technologies Research Group
Single-Input, Single-Output Macro-modules Connected by buffers Control and Data move along through Macro-modules
Teak Model of Computation: Macro-modules
Teak uses three hierarchical primitives to form a dataflow Network Sequential SteerMerge Iterative
6 The Advanced Processor Technologies Research Group
Protocol: Conventional Synchronous vs. Elastic
In Conventional
- Sync. Systems
Timing alignment by inserting buffers in post-
- Sync. Systems
Latency = 0 In Elastic Systems Latency can vary
buffers in post- synthesis stage System tolerates variations in latencies through handshaking
[Synchronous] [Asynchronous]
7
In
- Sync. Elastic
Systems Latency is discretised by clock
[Asynchronous] [Synchronous Elastic]
The Advanced Processor Technologies Research Group
A common timing discipline is introduced to the handshake system
A Common Timing Discipline for Asynchronous Dataflow Networks of Teak
Synchronous Elastic* protocol is incorporated in Teak flow as a common timing discipline:
- Deterministic behaviour (bounded delays)
- Simplified deadlock issue in the network
- Smaller circuit area (~4 times)
- Still Preserves slack elasticity (any storage on links)
- Improved power utility (clock gating + simple handshake)
SDF
8
*Synchronous Elastic Flow (SELF) [3]
The Advanced Processor Technologies Research Group
CSP Networks [non-Deterministic] Kahn Process Networks [Deterministic]
Variables in dataflow networks: single write/multiple read Variable provides a place for data tokens, so >2 latches to
Correctness in Asynchronous dataflow networks of Teak
Variable provides a place for data tokens, so >2 latches to ensure deadlock freedom
9 The Advanced Processor Technologies Research Group
Variables in eTeak: Elastic Controllers with a pair of latches
- perating at opposite clock phases
Operations : write takes 1 cycle and read take 0 cycles
Correctness in SELF Adapted Networks
Operations : write takes 1 cycle and read take 0 cycles Each variable provides two places for data tokens Loops with write/read operations do not need extra latches
10 The Advanced Processor Technologies Research Group
Synchronous Crystallisation: Regional transformation of a dataflow into a synchronous control-driven circuit through re-synthesis The candidates for Crystallisation are selected based on their physical
Synchronous Crystallisation and Re-synthesis
The candidates for Crystallisation are selected based on their physical characteristics (e.g. critical path) Synthesis at system level enables us to rapidly explore different trade-offs between power, performance and area
11 The Advanced Processor Technologies Research Group
Crystallisation: Through RTL Transformation
By extracting the occurrence graph and detecting concurrent dataflows within the Teak Network What we achieve by this transformation: Locally synchronous – deterministic behaviour – reduced fine-grained communication overhead Easier modelling and partitioning towards GALSification Use the power of Clocked EDA to re-synthesise Use the power of Clocked EDA to re-synthesise Pipelined structures – better Throughput
12 The Advanced Processor Technologies Research Group
Elastic to RTL Transformation: The Algorithm
Root
Case A
When Root is a Fork and MM1 / MM2 are
A B MM1 MM2
When Root is a Fork and MM1 / MM2 are independent: always @ (posedge CLK) : FSM_A1 Out _1 <= φ1 (A,B) always @ (posedge CLK) : FSM_A2 Out_2 <= φ2 (A,B)
φ1 φ2
13 The Advanced Processor Technologies Research Group
Sink
Out_2 <= φ2 (A,B) assign Out = Join (Out_1, Out_2)
Out
φ1 φ2
Elastic to RTL Transformation: The Algorithm
Root
Case B
When Root is a Fork and MM1 / MM2 are
A B MM1 MM2
When Root is a Fork and MM1 / MM2 are dependent: always @ (posedge CLK) : FSM_B State1: Out_temp <= φ1 (A, B) State2: Out_2 <= φ2 (A, B, Out_temp )
φ1 φ2
14 The Advanced Processor Technologies Research Group
Sink
assign Out = Out_2
Out
φ2
Elastic to RTL Transformation: The Algorithm
Case C
When Root is a Splitter/Steer:
Root A B
When Root is a Splitter/Steer: always @ (posedge CLK) : FSM_C State_Root: Case (A,B) 1: State1 2: State2 State1: Out_1 <= φ1 (A, B)
MM1 MM2
φ2
15 The Advanced Processor Technologies Research Group
State1: Out_1 <= φ1 (A, B) State2: Out_2 <= φ2 (A, B) assign Out = Merge (Out_1, Out_2)
Sink Out
φ1 φ2
RTL Transformation for the Shifter Example
In this example within Macro-modules Root is a Splitter (Case C) whilst Macro-modules are dependent (Case B), therefore the whole structure is transformed to a single FSM therefore the whole structure is transformed to a single FSM
16 The Advanced Processor Technologies Research Group
eTeak Snapshot: Visual Crystallised Partitions
17 The Advanced Processor Technologies Research Group
Case Study: SSEM, A three stage iterative Processor implemented in Balsa Deadlock-free design: Async. (65 Buffers) vs. Sync. Elastic (6 Buffers) The slack elastic property is preserved
- Async. vs. Sync. Elastic: Area Cost
50000 60000
Asynchronous Synchronous Elastic F-J-M-S Variables Subtracter The slack elastic property is preserved
10000 20000 30000 40000
Area Cost
Subtracter Latch
18 The Advanced Processor Technologies Research Group
Application: GCD (67, 2) : 250 Instructions Slack Matching can potentially improve the performance by a factor of 3
Asynchronous vs. Synchronous Elastic SSEM
Asynchronous vs. Synchronous Elastic SSEM
40 60 80 100 120 140 160
Asynchronous vs. Synchronous Elastic SSEM
Area 1/Throughput
19
*Fully buffered to approve the slack elastic property
The Advanced Processor Technologies Research Group Asynchronous* Synchronous Elastic* (f = 1.250 GHz ) Asynchronous Synchronous Elastic (f = 435 MHz) Solid Synchronous (f = 1.1GHz) Total Cell Area (k*mm^2) 68.41 47.447 56.183 12.563 7.723
- Exec. Time (10*ms)
40.61 147.47 46.5 62.04 16.438
20
Summary:
- A framework for exploring GALSification: an extension to the Teak
EDA flow which provides a framework for exploring GALSification
Summary & Future work
EDA flow which provides a framework for exploring GALSification techniques and Behavioural partitioning
- A re-synthesis mechanism to exploit synchronous EDA: exploiting
the synchronous elastic protocol to move from the asynchronous domain to the synchronous domain where it is possible to leverage synchronous EDAs to improve the circuits
Future Work:
- Automatic partitioning the system into multiple clock domains:
Running the re-synthesised structures with different clock frequency based on their behaviour is what we pursue as future work
20 The Advanced Processor Technologies Research Group
[1]. Wei Song, Jim D. Garside, Doug Edwards: ”Automatic data path
References
[1]. Wei Song, Jim D. Garside, Doug Edwards: ”Automatic data path extraction in large-scale register-transfer level designs”. ISCAS 2014: 377-380 [2]. Wei Song, Jim D. Garside: ”Automatic Controller Detection for Large Scale RTL Designs”. DSD 2013: 844-851 [3]. Cortadella, Jordi, Mike Kishinevsky, and Bill Grundmann. "SELF: Specification and design of a synchronous elastic architecture for DSM systems." TAU’2006: Handouts of the International Workshop
- n Timing Issues in the Specification and Synthesis of Digital
- n Timing Issues in the Specification and Synthesis of Digital
- Systems. 2006.
The Advanced Processor Technologies Research Group 21
Youtube: eTeak - A Synchronous Elastic Dataflow Synthesiser
Thanks for Listening!
22 The Advanced Processor Technologies Research Group