Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic - - PowerPoint PPT Presentation

optimised synthesis of optimised synthesis of
SMART_READER_LITE
LIVE PREVIEW

Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic - - PowerPoint PPT Presentation

Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic Dataflows by Leveraging Clocked EDA Mahdi Jelodari Mamaghani, Jim Garside, Will Toms, & Doug Edwards & Doug Edwards Verona, Italy 29 th August 2014 Motivation:


slide-1
SLIDE 1

Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic Dataflows by Leveraging Clocked EDA

Mahdi Jelodari Mamaghani, Jim Garside, Will Toms, & Doug Edwards & Doug Edwards Verona, Italy 29th August 2014

slide-2
SLIDE 2

Motivation: Automatic GALSification of Control-driven Systems

Partitioning a control-driven system at behavioural level is complicated Detecting signal correspondence between data and control path is error prone Presence of global control impedes the detection process [1,2] Presence of global control impedes the detection process [1,2] Balsa, Petrify, VeriSyn & AVS : Popular Control-driven synthesis tools GCD a data-dependent loop example:

[GCD Control Path] [GCD Data Path]

The Advanced Processor Technologies Research Group

slide-3
SLIDE 3

Concurrent dataflow Specification

  • f a System

Motivation: Automatic GALSification of Data-driven Systems

Fine-grained Dataflow Synthesis

3 The Advanced Processor Technologies Research Group

Automatically Partitioning the System into multiple clocked islands

slide-4
SLIDE 4

Released in 2010 as a Dataflow Syntax-directed Synthesis backend for Balsa language [ACSD’09]. Some of the properties of Teak Dataflow Networks (TDNs):

Teak: Asynchronous Dataflow Backend for Balsa Language

Some of the properties of Teak Dataflow Networks (TDNs):

  • Communication :

Point-to-point communication between computation blocks. Slack elastic channels are capable of storing ‘any number’ of tokens.

  • Computational :
  • Computational :

Macro-module style with separate Go and Done activation signals. These

modules are chained in sequence or parallel according to the source level directives.

Dataflow which realises data-dependent computation.

4 The Advanced Processor Technologies Research Group

slide-5
SLIDE 5

Teak: Behavioural Synthesis Flow

Syntax Directed

5 The Advanced Processor Technologies Research Group

slide-6
SLIDE 6

Single-Input, Single-Output Macro-modules Connected by buffers Control and Data move along through Macro-modules

Teak Model of Computation: Macro-modules

Teak uses three hierarchical primitives to form a dataflow Network Sequential SteerMerge Iterative

6 The Advanced Processor Technologies Research Group

slide-7
SLIDE 7

Protocol: Conventional Synchronous vs. Elastic

In Conventional

  • Sync. Systems

Timing alignment by inserting buffers in post-

  • Sync. Systems

Latency = 0 In Elastic Systems Latency can vary

buffers in post- synthesis stage System tolerates variations in latencies through handshaking

[Synchronous] [Asynchronous]

7

In

  • Sync. Elastic

Systems Latency is discretised by clock

[Asynchronous] [Synchronous Elastic]

The Advanced Processor Technologies Research Group

A common timing discipline is introduced to the handshake system

slide-8
SLIDE 8

A Common Timing Discipline for Asynchronous Dataflow Networks of Teak

Synchronous Elastic* protocol is incorporated in Teak flow as a common timing discipline:

  • Deterministic behaviour (bounded delays)
  • Simplified deadlock issue in the network
  • Smaller circuit area (~4 times)
  • Still Preserves slack elasticity (any storage on links)
  • Improved power utility (clock gating + simple handshake)

SDF

8

*Synchronous Elastic Flow (SELF) [3]

The Advanced Processor Technologies Research Group

CSP Networks [non-Deterministic] Kahn Process Networks [Deterministic]

slide-9
SLIDE 9

Variables in dataflow networks: single write/multiple read Variable provides a place for data tokens, so >2 latches to

Correctness in Asynchronous dataflow networks of Teak

Variable provides a place for data tokens, so >2 latches to ensure deadlock freedom

9 The Advanced Processor Technologies Research Group

slide-10
SLIDE 10

Variables in eTeak: Elastic Controllers with a pair of latches

  • perating at opposite clock phases

Operations : write takes 1 cycle and read take 0 cycles

Correctness in SELF Adapted Networks

Operations : write takes 1 cycle and read take 0 cycles Each variable provides two places for data tokens Loops with write/read operations do not need extra latches

10 The Advanced Processor Technologies Research Group

slide-11
SLIDE 11

Synchronous Crystallisation: Regional transformation of a dataflow into a synchronous control-driven circuit through re-synthesis The candidates for Crystallisation are selected based on their physical

Synchronous Crystallisation and Re-synthesis

The candidates for Crystallisation are selected based on their physical characteristics (e.g. critical path) Synthesis at system level enables us to rapidly explore different trade-offs between power, performance and area

11 The Advanced Processor Technologies Research Group

slide-12
SLIDE 12

Crystallisation: Through RTL Transformation

By extracting the occurrence graph and detecting concurrent dataflows within the Teak Network What we achieve by this transformation: Locally synchronous – deterministic behaviour – reduced fine-grained communication overhead Easier modelling and partitioning towards GALSification Use the power of Clocked EDA to re-synthesise Use the power of Clocked EDA to re-synthesise Pipelined structures – better Throughput

12 The Advanced Processor Technologies Research Group

slide-13
SLIDE 13

Elastic to RTL Transformation: The Algorithm

Root

Case A

When Root is a Fork and MM1 / MM2 are

A B MM1 MM2

When Root is a Fork and MM1 / MM2 are independent: always @ (posedge CLK) : FSM_A1 Out _1 <= φ1 (A,B) always @ (posedge CLK) : FSM_A2 Out_2 <= φ2 (A,B)

φ1 φ2

13 The Advanced Processor Technologies Research Group

Sink

Out_2 <= φ2 (A,B) assign Out = Join (Out_1, Out_2)

Out

φ1 φ2

slide-14
SLIDE 14

Elastic to RTL Transformation: The Algorithm

Root

Case B

When Root is a Fork and MM1 / MM2 are

A B MM1 MM2

When Root is a Fork and MM1 / MM2 are dependent: always @ (posedge CLK) : FSM_B State1: Out_temp <= φ1 (A, B) State2: Out_2 <= φ2 (A, B, Out_temp )

φ1 φ2

14 The Advanced Processor Technologies Research Group

Sink

assign Out = Out_2

Out

φ2

slide-15
SLIDE 15

Elastic to RTL Transformation: The Algorithm

Case C

When Root is a Splitter/Steer:

Root A B

When Root is a Splitter/Steer: always @ (posedge CLK) : FSM_C State_Root: Case (A,B) 1: State1 2: State2 State1: Out_1 <= φ1 (A, B)

MM1 MM2

φ2

15 The Advanced Processor Technologies Research Group

State1: Out_1 <= φ1 (A, B) State2: Out_2 <= φ2 (A, B) assign Out = Merge (Out_1, Out_2)

Sink Out

φ1 φ2

slide-16
SLIDE 16

RTL Transformation for the Shifter Example

In this example within Macro-modules Root is a Splitter (Case C) whilst Macro-modules are dependent (Case B), therefore the whole structure is transformed to a single FSM therefore the whole structure is transformed to a single FSM

16 The Advanced Processor Technologies Research Group

slide-17
SLIDE 17

eTeak Snapshot: Visual Crystallised Partitions

17 The Advanced Processor Technologies Research Group

slide-18
SLIDE 18

Case Study: SSEM, A three stage iterative Processor implemented in Balsa Deadlock-free design: Async. (65 Buffers) vs. Sync. Elastic (6 Buffers) The slack elastic property is preserved

  • Async. vs. Sync. Elastic: Area Cost

50000 60000

Asynchronous Synchronous Elastic F-J-M-S Variables Subtracter The slack elastic property is preserved

10000 20000 30000 40000

Area Cost

Subtracter Latch

18 The Advanced Processor Technologies Research Group

slide-19
SLIDE 19

Application: GCD (67, 2) : 250 Instructions Slack Matching can potentially improve the performance by a factor of 3

Asynchronous vs. Synchronous Elastic SSEM

Asynchronous vs. Synchronous Elastic SSEM

40 60 80 100 120 140 160

Asynchronous vs. Synchronous Elastic SSEM

Area 1/Throughput

19

*Fully buffered to approve the slack elastic property

The Advanced Processor Technologies Research Group Asynchronous* Synchronous Elastic* (f = 1.250 GHz ) Asynchronous Synchronous Elastic (f = 435 MHz) Solid Synchronous (f = 1.1GHz) Total Cell Area (k*mm^2) 68.41 47.447 56.183 12.563 7.723

  • Exec. Time (10*ms)

40.61 147.47 46.5 62.04 16.438

20

slide-20
SLIDE 20

Summary:

  • A framework for exploring GALSification: an extension to the Teak

EDA flow which provides a framework for exploring GALSification

Summary & Future work

EDA flow which provides a framework for exploring GALSification techniques and Behavioural partitioning

  • A re-synthesis mechanism to exploit synchronous EDA: exploiting

the synchronous elastic protocol to move from the asynchronous domain to the synchronous domain where it is possible to leverage synchronous EDAs to improve the circuits

Future Work:

  • Automatic partitioning the system into multiple clock domains:

Running the re-synthesised structures with different clock frequency based on their behaviour is what we pursue as future work

20 The Advanced Processor Technologies Research Group

slide-21
SLIDE 21

[1]. Wei Song, Jim D. Garside, Doug Edwards: ”Automatic data path

References

[1]. Wei Song, Jim D. Garside, Doug Edwards: ”Automatic data path extraction in large-scale register-transfer level designs”. ISCAS 2014: 377-380 [2]. Wei Song, Jim D. Garside: ”Automatic Controller Detection for Large Scale RTL Designs”. DSD 2013: 844-851 [3]. Cortadella, Jordi, Mike Kishinevsky, and Bill Grundmann. "SELF: Specification and design of a synchronous elastic architecture for DSM systems." TAU’2006: Handouts of the International Workshop

  • n Timing Issues in the Specification and Synthesis of Digital
  • n Timing Issues in the Specification and Synthesis of Digital
  • Systems. 2006.

The Advanced Processor Technologies Research Group 21

slide-22
SLIDE 22

Youtube: eTeak - A Synchronous Elastic Dataflow Synthesiser

Thanks for Listening!

22 The Advanced Processor Technologies Research Group

We acknowledge EPSRC for supporting this research under GAELS project “Globally Asynchronous Elastic Logic Synthesis” (EP/I038306/1)