Automatic Generation of High Throughput Energy Efficient Streaming - - PowerPoint PPT Presentation
Automatic Generation of High Throughput Energy Efficient Streaming - - PowerPoint PPT Presentation
Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary Fixed Permutations Ren Chen, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering Presented by: Ajitesh Srivastava, Department of
- Introduction
- Background and Related Work
- High Throughput and Energy Efficient Design
- Experimental Results
- Conclusion and Future Work
Outline
2
- Permutation
- A permutation can be represented using
π§ = π
π β π¦
- π is the size of vectors π¦ and π§
- The π Γ π bit matrix π
π is called as a permutation matrix
3
Permutation
π¦0 π¦1 π¦2 π¦3 π§0 π§1 π§2 π§3
- Key Algorithms: FFT, sorting, Viterbi decoding, etc.
4
Related Applications
Frequency domain in images Image filtering Audio analysis Bitonic sort Partial differential equations
P4,2 P4,2 P4,2 P4,2 P8,4 P8,2 Input Output P4,2 P4,2 P4,2 P4,2 Q8
OFDM System
5
Data Permutation in Conventional Architectures
- Permutation by wires
- Parallel architecture
- Permutation by memory
- r registers
- Pipeline architecture
- Shared memory architecture
(c) Pipeline Architecture
Processing Element Processing Element Processing Element
Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element Processing Element
Input Output
...
(a) Parallel Architecture
Processing Element Bank 1 Bank 2 Bank r
Shared memory (b) Shared memory Architecture
6
Data Permutation in Streaming Architectures
- Streaming architecture
- High data parallelism
- High design throughput
- Simple control scheme
- No requirement on data input/ output order
Permute streaming data with a fixed data parallelism
- Input/output: in a streaming manner and at a fixed rate
- Data parallelism π: # of inputs processed each cycle per computation stage
- Streaming permutation: permutation between adjacent computation stages
- Processing elements: computation units for a given application
7
Problem Definition
FPGA p
External memory
Memory Interface
β¦
Stream input Stream output Processing elements
β¦ β¦
Streaming Permutation
β¦ β¦
Processing elements Processing elements
β¦
- Introduction
- Background and Related Work
- High Throughput and Energy Efficient Design
- Experimental Results
- Conclusion and Future Work
Outline
8
9
Related Work (JVSP β07, T. JARVINEN)
- For stride permutation on
array processor
- Flexible data parallelism
- Mathematical formulation
10
Related Work (DAC β12, M. Zuluaga and M. PΓΌschel)
- Domain-specific language based
- Hardware generator for data permutations in sorting
11
Proposed Design Approach
- Drawbacks of the state-of-the-art
- Only supports specific permutation patterns
- Design scalability needs to be improved
- Not memory efficient
- No efficient control logic
- We propose a mapping approach to obtain a streaming permutation
architecture
- Utilizes Benes network for building datapath and generating control logic
- Highly optimized wrt. memory efficiency and interconnection complexity
- Scalable with problem size π and data parallelism π
- Supports processing continuous data streams
- Design automation tool
12
Benes Network
- Multistage network to realize all π! permutations
- Rearrangeably non-blocking
- Introduction
- Background and Related Work
- High Throughput and Energy Efficient Design
- Experimental Results
- Conclusion and Future Work
Outline
13
- Parameterized architecture
- Problem size π
- Data parallelism π
- Memory based
- π independent memory blocks
- Each of size π/π
- π-to-π connection network
- π
2 log π 2 Γ 2 switches
- Optimal compared with state
- of-the art
- Highly optimized control unit
14
Architecture Overview
- Vertically fold the Benes Network
- Build a three-stage datapath
- Divide-and-conquer based method
- For a fixed data parallelism π
- Support continuous data streams
15
Proposed Mapping Approach
16
Automatic Generation of the Datapath (1)
- GDP(π, π): Generating Datapath
- π: problem size, π: data parallelism
- π΅: upper part of datapath, πΆ: lower part of datapath
17
Automatic Generation of the Datapath (2)
18
Automatic Generation of Control Logic (1)
- Configuration bits of switch in different states
19
Automatic Generation of Control Logic (2)
- Single Stage Routing
- π: input data vector
- π: permuted data vector
- π: mapping from π to π
- πβ²: output data vector of
input switches
- πβ²: input data vector of
- utput switches
20
Automatic Generation of Control Logic (3)
- Multiple Stage Routing
- π: input data vector
- π: permuted data vector
- π: data parallelism
21
An Example
25
Resource Consumption Summary
- Introduction
- Background and Related Work
- High Throughput and Energy Efficient Design
- Experimental Results
- Conclusion and Future Work
Outline
26
- Throughput
- Defined as the number of bits permuted per second (Gbits/s)
- Product of number of data elements permuted per second and data
width per element
- Energy efficiency
- Defined as the number of bits permuted per unit energy consumption
(Gbits/Joule)
- Calculated as the throughput divided by the average power
consumption
27
Performance metrics
- Platform and tools
- Xilinx Virtex-7 XC7VX980T , speed grade -2L
- Xilinx Vivado 2014.2 and Vivado Power Analyzer
- Input vectors for simulation
- Randomly generated with an average toggle rate of 25% (pessimistic estimation)
- Performance metrics
- Resource consumption
- Throughput
- Energy efficiency
28
Experimental Setup
- BRAM consumption of the proposed design
- Theoretic memory requirement: reduced by 50% and 75%
- Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various π
29
Experimental Results (1)
- BRAM consumption of the proposed design
- Theoretic memory requirement: reduced by 50% and 75%
- Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various π
30
Experimental Results (2)
- LUT consumption of the proposed design (for various p)
- 22.1%~67.2% less LUTs compared with [1]
- 59.1%~96.4% less LUTs compared with [6]
31
Experimental Results (3)
- LUT consumption of the proposed design (for various N)
- 6.6%~65.4% less LUTs compared with [1]
- 59.1%~96.4% less LUTs compared with [6]
32
Experimental Results (4)
- Throughput performance of the proposed design
- Our designs achieve
- Up to 78% throughput improvement compared with [1]
- Up to 3.4x throughput compare with [6]
33
Experimental Results (5)
- Energy efficiency comparison
- 2.1x~3.3x energy efficiency improvement compared with the state-of-the-art in [6]
34
Experimental Results (6)
Conclusion and Future Work
35 35
- Conclusion
- Streaming data permutation architecture
- Scalable with data parallelism and problem size
- Efficient data permutation realization
- βProgrammableβ data permutation engine
- High throughput and resource efficient
- Future work
- Design framework for automatic application-specific energy
efficiency and performance optimizations on FPGA
36 36