FPGA Architecture Support for Heterogeneous, Relocatable Partial - - PowerPoint PPT Presentation

fpga architecture support for heterogeneous relocatable
SMART_READER_LITE
LIVE PREVIEW

FPGA Architecture Support for Heterogeneous, Relocatable Partial - - PowerPoint PPT Presentation

24th International Conferenceon Field Programmable Logic and Applications September 3 rd , 2014 FPGA Architecture Support for Heterogeneous, Relocatable Partial Bitstreams Christophe H URIAUX v , Olivier S ENTIEYS v , Russell T ESSIER


slide-1
SLIDE 1

1

24th International Conferenceon Field Programmable Logic and Applications

September 3rd, 2014

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 1

FPGA Architecture Support for Heterogeneous, Relocatable Partial Bitstreams

Christophe HURIAUXv, Olivier SENTIEYSv★, Russell TESSIER✜

University of Rennes 1, France v Inria, France ★ University of Massachusetts, USA ✜

slide-2
SLIDE 2

2

Outline

§ Introduction

§ Overview of the FlexTiles project § Architecture Overview § Advantages of 3-D Stacking

§ Principles

§ Task Migration in an FPGA § Task Migration in FlexTiles § Heterogeneous case

§ Approach

§ Coping with Heterogeneity § Design Constraints

§ Results

§ Implementation in VPR

§ Conclusion

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 2
slide-3
SLIDE 3

3

FP7 FlexTiles Project

§ FlexTiles: Self adaptive heterogeneous manycore based

  • n Flexible Tiles

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 3

§ Provide a heterogeneous many-core architecture offering

§ Large flexibility § High-performance, energy efficiency § Raised programming efficiency § Self-adaptation through virtualization

slide-4
SLIDE 4

4

Architecture Overview

§ 3D-Stacked Heterogeneous manycore

§ General Purpose Processors (GPP)

§ for flexibility and programming homogeneity

§ Network On Chip

§ Dedicated hardware accelerators mapped at run-time on a reconfigurable layer

§ Reconfigurable layer with seamless task migration capabilities § Virtualization layer to provide an abstraction of the manycore and self adaptive services § Tool-chain for parallelization and compilation

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 4
slide-5
SLIDE 5

5

Architecture Overview

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 5
  • 5

3D interface to the NoC DSP blocks Memory blocks

slide-6
SLIDE 6

6

Task migration

§ Classical problem in dynamic reconfiguration[1] § Enhance resource usage

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 6

4x4

?

[1] K. Compton, Z. Li, J. Cooley, S. Knol, and S. Hauck, “Configuration relocation and defragmentation for run-time reconfigurable computing,” IEEE Transactions on VLSI Systems, vol. 10, no. 3, pp. 209 –220, 2002.

slide-7
SLIDE 7

7

3D Stacking

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 7
  • 7

Core Core Core Core Core Core Core Core Core

reconfigurable layer multicore layer

§ 3D-Stacked Reconfigurable Accelerators

§ Improved resource usage § Improved bandwidth/latency § Improved performance and energy efficiency

Core Core Core Core Core Core Core Core Core

slide-8
SLIDE 8

8

Task Migration in an FPGA

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 8

§ Predefined reconfigurable regions

§ Bit-stream depends on task location

I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O

HW Acce ccelerator #1 BS #1 BS #1 HW Acce ccelerator #1 BS # BS #2

slide-9
SLIDE 9

9

Task Migration in FlexTiles

§ A task is synthesized, placed & routed into a Virtual Bit-Stream (VBS)

§ Independent from task physical location in the fabric § No predefined configuration domains

§ Resource sharing/distribution easiness, simplified task migration

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 9

1231 13212 32 12

  • 2

12

3 1321

  • § Reconfiguration

controller generates final BS at run-time

slide-10
SLIDE 10

10

Task Migration in FlexTiles

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 10

3D NI 3D NI 3D NI 3D NI

RAM DSP RAM DSP RAM DSP RAM DSP

3D NI 3D NI 3D NI 3D NI 3D NI 3D NI 3D NI 3D NI 3D NI 3D NI 3D NI

HW Acce ccelerator #2 #2 VBS VBS #2 HW Acce ccelerator #1 VBS VBS #1 #1

slide-11
SLIDE 11

11

Heterogeneity

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 11

§ Homogeneous case

§ No constraint on task placement § Regular routing architecture

§ Cope with heterogeneity

§ RAM, DSP, 3D I/Os § Migration is limited

§ vertically to the same column § to the next column containing same

complex blocks Task Configured LE Logic Element (LE)

slide-12
SLIDE 12

12

Proposed architecture

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 12

§ Heterogeneous blocks routing is abstracted from logic routing

§ Long lines allow a trade-off between placement flexibility and routing complexity § A two-level routing is performed at runtime:

§ Logic routing (as in the homogeneous case) § Heterogeneous block routing through long lines

slide-13
SLIDE 13

13

Design Constraints

§ I/Os are made through 3D Network Interfaces, spread over the reconfigurable fabric

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 13

Reconfiguration RAM

Reconfiguration CTRL

M E M D S P

3D NI AI 3D NI AI

D S P D S P D S P D S P D S P D S P D S P D S P D S P D S P M E M M E M M E M M E M M E M M E M M E M

3D NI AI 3D NI AI 3D NI AI 3D NI AI 3D NI AI 3D NI AI

D S P D S P D S P D S P D S P M E M M E M M E M

3D NI

M E M M E M D S P D S P D S P D S P D S P D S P D S P D S P D S P D S P D S P M E M M E M M E M M E M M E M M E M M E M D S P D S P D S P D S P D S P M E M M E M M E M M E M

AI

slide-14
SLIDE 14

14

Implementation in VPR

§ Versatile Place and Route (VPR), open source CAD tool for placement and routing § Part of the Verilog To Routing (VTR) framework § Source code modified to implement our techniques and deal with our constraints

§ Horizontal long-lines spread over partitions § Separate homogeneous and heterogeneous routing

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 14

VPR and VTR: https://code.google.com/p/vtr-verilog-to-routing/

slide-15
SLIDE 15

15

Implementation in VPR

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 15

X X

Y

X X

Fc=0.5 Fc=1 VPR Original Routing Model

§ Logic grid § Block placement

§ X: simple block § Y: 2 blocks tall

§ Mesh routing lines § Switch boxes § Interconnect

slide-16
SLIDE 16

16

Implementation in VPR

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 16

Y

X X X X

Enhanced Routing Model

§ Logic grid § Block placement § Block typing

§ X: homogeneous § Y: heterogeneous

§ Mesh routing lines § Long lines § Switch boxes § Interconnect

§ Homogeneous § Heterogeneous

slide-17
SLIDE 17

17

Results

§ Architecture based on a simplified Stratix IV with:

§ Dual-port 144k memories § Fracturable 36x36 multipliers

§ Evaluation on two criteria

§ Delay of the critical path § Minimum channel width

§ Number of tracks in the homogeneous routing channels

§ Minimum channel width determined by VPR

§ Not directly related to silicon area

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 17
slide-18
SLIDE 18

18

Results

§ Benchmark set: VTR framework circuits [1]

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 18

[1] Rose, Jonathan, Luu, Jason, Yu, Chi Wai, et al. The VTR project: architecture and CAD for FPGAs from verilog to routing. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. ACM, 2012. p. 77-86.

Circuit # Mem # Mult # LB

bgm 11 2,174 boundtop 1 2,977 ch_intrinsics 1 272 diffeq1 5 41 diffeq2 5 43 LU8PEEng 45 8 30 mkDelayWorker32B 41 497 mkPktMerge 15 17 mkSMAdapter4B 5 181

  • r1200

2 1 273 raygentop 1 7 192 stereovision1 38 990

slide-19
SLIDE 19

19

Results: Delay

§ Estimation of the worst case delay

§ Impossible to predict where connections to long lines will be done § Some channels crossing fixed-function blocks are longer

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 19
slide-20
SLIDE 20

20

Results: Delay

§ Only 2% delay increase (in average)

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 20

0,2 0,4 0,6 0,8 1 1,2 0,00 20,00 40,00 60,00 80,00 100,00 120,00 140,00 160,00 propose sed/cl classi ssic ns ns

  • Crit. Path (classic)
  • Crit. Path. (enhanced)
  • Crit. Path. (ratio)
slide-21
SLIDE 21

21

Results: Min. Channel Width

§ 1.8X channel width increase on average § Need for specific routing algorithms to deal with the heterogeneous interconnection network

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 21

0,5 1 1,5 2 2,5 3 3,5 4 4,5 0,00 20,00 40,00 60,00 80,00 100,00 120,00 140,00 160,00 propose sed/cl classi ssic # tracks cks min W (classic) min W (enhanced) min W (ratio)

slide-22
SLIDE 22

22

Conclusion

§ FPGA embedded in a 3D architecture § More flexibility for task placement and/or relocation § Low impact on delay but cost on routing resources § Need to find a trade-off between flexibility and area increase of additional connections

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 22
slide-23
SLIDE 23

23

Thank you for your attention

More info on FlexTiles: http://www.flextiles.eu

  • C. Huriaux, O. Sentieys and R. Tessier

September 3rd, 2014 - 23

slide-24
SLIDE 24

24

Thank you for your attention

  • C. Huriaux, O. Sentieys and R. Tessier

September 3rd, 2014 - 24

slide-25
SLIDE 25

25

Virtual Bit-Stream: Example

§ Hiding routing details

§ Full BS is 129 bits § Could be reduced by giving less details

  • Jan. 2014

CAIRN project-team

  • 25

CLBIN[1] CLBIN[2 ] CLBIN[3] CLBOUT CLBIN[0] 4 5 6 7 12 13 14 15 0 1 2 3 8 9 10 11 16 17 18 19 20

slide-26
SLIDE 26

26

Virtual Bit-Stream: Example

§ Hiding routing details

§ List of I/O and connections

§ 20 è 8 § 1 è 9 § 5 è 18

  • Jan. 2014

CAIRN project-team

  • 26

4 5 6 7 0 1 2 3 8 9 10 11 16 17 18 19 20 12 13 14 15

slide-27
SLIDE 27

27

Results: BS Sizes on MCNC Benchmarks

0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" tseng" tseng" diffeq" diffeq" apex4" des" ex5p" misex3" Kilo%bits) Rou:ng" Logic"

  • Jan. 2014

CAIRN project-team

  • 27
slide-28
SLIDE 28

28

Results: VBS Sizes on MCNC Benchmarks

44.4%$ 49.2%$ 47.2%$ 55.2%$ 49.7%$ 29.5%$ 27.4%$ 26.6%$

0.0%$ 10.0%$ 20.0%$ 30.0%$ 40.0%$ 50.0%$ 60.0%$ 70.0%$ 80.0%$ 90.0%$ 100.0%$ 0$ 200$ 400$ 600$ 800$ 1000$ 1200$ 1400$ 1600$ tseng$ tseng$ diffeq$ diffeq$ apex4$ des$ ex5p$ misex3$ Kilo%bits) BS$size$ VBS$size$ Compression$raBo$

  • Jan. 2014

CAIRN project-team

  • 28
slide-29
SLIDE 29

29

Introduction: Architecture Overview

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 29
  • 29

3D Access Point to the NoC

slide-30
SLIDE 30

30

Introduction: Architecture Overview

September 3rd, 2014

  • C. Huriaux, O. Sentieys and R. Tessier
  • 30
  • 30

General Architecture Overview