Automatic Generation of High Throughput Energy Efficient Streaming - - PowerPoint PPT Presentation

β–Ά
automatic generation of high throughput energy efficient
SMART_READER_LITE
LIVE PREVIEW

Automatic Generation of High Throughput Energy Efficient Streaming - - PowerPoint PPT Presentation

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary Fixed Permutations Ren Chen, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering Presented by: Ajitesh Srivastava, Department of


slide-1
SLIDE 1

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary Fixed Permutations

Ren Chen, Viktor K. Prasanna

Ming Hsieh Department of Electrical Engineering

Presented by:

Ajitesh Srivastava, Department of Computer Science University of Southern California

Ganges.usc.edu/wiki/TAPAS

slide-2
SLIDE 2
  • Introduction
  • Background and Related Work
  • High Throughput and Energy Efficient Design
  • Experimental Results
  • Conclusion and Future Work

Outline

2

slide-3
SLIDE 3
  • Permutation
  • A permutation can be represented using

𝑧 = 𝑄

𝑛 βˆ™ 𝑦

  • 𝑛 is the size of vectors 𝑦 and 𝑧
  • The 𝑛 Γ— 𝑛 bit matrix 𝑄

𝑛 is called as a permutation matrix

3

Permutation

𝑦0 𝑦1 𝑦2 𝑦3 𝑧0 𝑧1 𝑧2 𝑧3

slide-4
SLIDE 4
  • Key Algorithms: FFT, sorting, Viterbi decoding, etc.

4

Related Applications

Frequency domain in images Image filtering Audio analysis Bitonic sort Partial differential equations

P4,2 P4,2 P4,2 P4,2 P8,4 P8,2 Input Output P4,2 P4,2 P4,2 P4,2 Q8

OFDM System

slide-5
SLIDE 5

5

Data Permutation in Conventional Architectures

  • Permutation by wires
  • Parallel architecture
  • Permutation by memory
  • r registers
  • Pipeline architecture
  • Shared memory architecture

(c) Pipeline Architecture

Processing Element Processing Element Processing Element

Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element

Input Output

...

(a) Parallel Architecture

Processing Element Bank 1 Bank 2 Bank r

Shared memory (b) Shared memory Architecture

slide-6
SLIDE 6

6

Data Permutation in Streaming Architectures

  • Streaming architecture
  • High data parallelism
  • High design throughput
  • Simple control scheme
  • No requirement on data input/ output order
slide-7
SLIDE 7

Permute streaming data with a fixed data parallelism

  • Input/output: in a streaming manner and at a fixed rate
  • Data parallelism π‘ž: # of inputs processed each cycle per computation stage
  • Streaming permutation: permutation between adjacent computation stages
  • Processing elements: computation units for a given application

7

Problem Definition

FPGA p

External memory

Memory Interface

…

Stream input Stream output Processing elements

… …

Streaming Permutation

… …

Processing elements Processing elements

…

slide-8
SLIDE 8
  • Introduction
  • Background and Related Work
  • High Throughput and Energy Efficient Design
  • Experimental Results
  • Conclusion and Future Work

Outline

8

slide-9
SLIDE 9

9

Related Work (JVSP ’07, T. JARVINEN)

  • For stride permutation on

array processor

  • Flexible data parallelism
  • Mathematical formulation
slide-10
SLIDE 10

10

Related Work (DAC ’12, M. Zuluaga and M. PΓΌschel)

  • Domain-specific language based
  • Hardware generator for data permutations in sorting
slide-11
SLIDE 11

11

Proposed Design Approach

  • Drawbacks of the state-of-the-art
  • Only supports specific permutation patterns
  • Design scalability needs to be improved
  • Not memory efficient
  • No efficient control logic
  • We propose a mapping approach to obtain a streaming permutation

architecture

  • Utilizes Benes network for building datapath and generating control logic
  • Highly optimized wrt. memory efficiency and interconnection complexity
  • Scalable with problem size 𝑂 and data parallelism π‘ž
  • Supports processing continuous data streams
  • Design automation tool
slide-12
SLIDE 12

12

Benes Network

  • Multistage network to realize all π‘œ! permutations
  • Rearrangeably non-blocking
slide-13
SLIDE 13
  • Introduction
  • Background and Related Work
  • High Throughput and Energy Efficient Design
  • Experimental Results
  • Conclusion and Future Work

Outline

13

slide-14
SLIDE 14
  • Parameterized architecture
  • Problem size 𝑂
  • Data parallelism π‘ž
  • Memory based
  • π‘ž independent memory blocks
  • Each of size 𝑂/π‘ž
  • π‘ž-to-π‘ž connection network
  • π‘ž

2 log π‘ž 2 Γ— 2 switches

  • Optimal compared with state
  • of-the art
  • Highly optimized control unit

14

Architecture Overview

slide-15
SLIDE 15
  • Vertically fold the Benes Network
  • Build a three-stage datapath
  • Divide-and-conquer based method
  • For a fixed data parallelism π‘ž
  • Support continuous data streams

15

Proposed Mapping Approach

slide-16
SLIDE 16

16

Automatic Generation of the Datapath (1)

  • GDP(𝑂, π‘ž): Generating Datapath
  • 𝑂: problem size, π‘ž: data parallelism
  • 𝐡: upper part of datapath, 𝐢: lower part of datapath
slide-17
SLIDE 17

17

Automatic Generation of the Datapath (2)

slide-18
SLIDE 18

18

Automatic Generation of Control Logic (1)

  • Configuration bits of switch in different states
slide-19
SLIDE 19

19

Automatic Generation of Control Logic (2)

  • Single Stage Routing
  • π‘Œ: input data vector
  • 𝑍: permuted data vector
  • 𝜌: mapping from π‘Œ to 𝑍
  • π‘Œβ€²: output data vector of

input switches

  • 𝑍′: input data vector of
  • utput switches
slide-20
SLIDE 20

20

Automatic Generation of Control Logic (3)

  • Multiple Stage Routing
  • π‘Œ: input data vector
  • 𝑍: permuted data vector
  • π‘ž: data parallelism
slide-21
SLIDE 21

21

An Example

slide-22
SLIDE 22

25

Resource Consumption Summary

slide-23
SLIDE 23
  • Introduction
  • Background and Related Work
  • High Throughput and Energy Efficient Design
  • Experimental Results
  • Conclusion and Future Work

Outline

26

slide-24
SLIDE 24
  • Throughput
  • Defined as the number of bits permuted per second (Gbits/s)
  • Product of number of data elements permuted per second and data

width per element

  • Energy efficiency
  • Defined as the number of bits permuted per unit energy consumption

(Gbits/Joule)

  • Calculated as the throughput divided by the average power

consumption

27

Performance metrics

slide-25
SLIDE 25
  • Platform and tools
  • Xilinx Virtex-7 XC7VX980T , speed grade -2L
  • Xilinx Vivado 2014.2 and Vivado Power Analyzer
  • Input vectors for simulation
  • Randomly generated with an average toggle rate of 25% (pessimistic estimation)
  • Performance metrics
  • Resource consumption
  • Throughput
  • Energy efficiency

28

Experimental Setup

slide-26
SLIDE 26
  • BRAM consumption of the proposed design
  • Theoretic memory requirement: reduced by 50% and 75%
  • Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various π‘ž

29

Experimental Results (1)

slide-27
SLIDE 27
  • BRAM consumption of the proposed design
  • Theoretic memory requirement: reduced by 50% and 75%
  • Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various 𝑂

30

Experimental Results (2)

slide-28
SLIDE 28
  • LUT consumption of the proposed design (for various p)
  • 22.1%~67.2% less LUTs compared with [1]
  • 59.1%~96.4% less LUTs compared with [6]

31

Experimental Results (3)

slide-29
SLIDE 29
  • LUT consumption of the proposed design (for various N)
  • 6.6%~65.4% less LUTs compared with [1]
  • 59.1%~96.4% less LUTs compared with [6]

32

Experimental Results (4)

slide-30
SLIDE 30
  • Throughput performance of the proposed design
  • Our designs achieve
  • Up to 78% throughput improvement compared with [1]
  • Up to 3.4x throughput compare with [6]

33

Experimental Results (5)

slide-31
SLIDE 31
  • Energy efficiency comparison
  • 2.1x~3.3x energy efficiency improvement compared with the state-of-the-art in [6]

34

Experimental Results (6)

slide-32
SLIDE 32

Conclusion and Future Work

35 35

  • Conclusion
  • Streaming data permutation architecture
  • Scalable with data parallelism and problem size
  • Efficient data permutation realization
  • β€œProgrammable” data permutation engine
  • High throughput and resource efficient
  • Future work
  • Design framework for automatic application-specific energy

efficiency and performance optimizations on FPGA

slide-33
SLIDE 33

36 36

Thanks! Questions? renchen@usc.edu (Ren Chen) Ganges.usc.edu/wiki/TAPAS