vector fpga acceleration of 1 d dwt computations using

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix - PowerPoint PPT Presentation

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons Sidharth Maheshwari, Gourav Modi, Siddhartha , Nachiket Kapre School of Computer Science and Engineering Nanyang Technological University Matrix-Form 1-D DWT


  1. Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons Sidharth Maheshwari, Gourav Modi, Siddhartha , Nachiket Kapre School of Computer Science and Engineering Nanyang Technological University

  2. Matrix-Form 1-D DWT β€’ Formulation: * 𝐷 = π‘ˆπ‘ % π‘Œ , where π‘ˆπ‘ = ∏ π‘ˆ ) () β€’ TM matrix is highly sparse Ø Large number of multiply-by-zero operations Ø Large memory footprint consisting of zeroes β€’ Goals: Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

  3. Matrix-Form 1-D DWT β€’ Formulation: * 𝐷 = π‘ˆπ‘ % π‘Œ , where π‘ˆπ‘ = ∏ π‘ˆ ) () β€’ TM matrix is highly sparse Ø Large number of multiply-by-zero operations Ø Large memory footprint consisting of zeroes β€’ Goals: Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

  4. Matrix-Form 1-D DWT β€’ Formulation: * 𝐷 = π‘ˆπ‘ % π‘Œ , where π‘ˆπ‘ = ∏ π‘ˆ ) () β€’ TM matrix is highly sparse Ø Large number of multiply-by-zero operations Ø Large memory footprint consisting of zeroes β€’ Goals: Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

  5. Sparse Matrix Skeleton 8 36 β€’ Remove multiply-by-zero operations β€’ Reduction in memory footprint of TM .

  6. Modified Matrix-Form 1-D DWT N = 65536

  7. VectorBlox MXP β€’ Lanes: 16-32 β€’ Scratchpad: 64-128 KB β€’ DMA bandwidth: 4-32 B/cycle

  8. Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP βˆ’ DE2 MXP βˆ’ DE4 MXP βˆ’ Zed Board 𝑂 = 2 -. , 𝑀 = 6 π‘π‘œπ‘’ 𝑙 = 3

  9. Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP βˆ’ DE2 MXP βˆ’ DE4 MXP βˆ’ Zed Board 𝑂 = 2 -. , 𝑀 = 6 π‘π‘œπ‘’ 𝑙 = 3

  10. Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP βˆ’ DE2 MXP βˆ’ DE4 MXP βˆ’ Zed Board 𝑂 = 2 -. , 𝑀 = 6 π‘π‘œπ‘’ 𝑙 = 3

  11. Summary β€’ We propose a Modified Matrix-Form scheme to unlock inherent parallelism in 1-D DWT β€’ We exploit the sparsity pattern in TM to reduce complexity from O( π‘œ 8 ) to O( π‘œ ) using : Skeletons to avoid wasteful multiply-by-zero operations Ø Rearrangement of input samples Ø β€’ Speedups of 12-103x over state-of-the-art in-built signal library in Octave( dwt function)

  12. Experimental Setup Matrix-form 1-D DWT Sparse Matrix Skeletons CPU CPU + MXP - Optimized OpenBLAS routines in - Customized DMA routines for data transfer between host and MXP Octave and C (compiled with –O3) - 16-32 vector lanes - Performance measured using PAPI v5.4.3 - 64-128KB scratchpad memory - Performance measured using MXP - 32b ARMv7 on Beaglebone Black, Timing API Zedboard, and ARMv6 on Raspberry Pi - Altera DE2/DE4 and Zedboard

  13. Results - Throughput ● ARM (Beagl.) ARM (Zedb.) MXP βˆ’ DE4 ● ARM (Rasp.) MXP βˆ’ DE2 ● MXP βˆ’ Zed 80 Energy (mJ) 60 40 ● ● 20 ● 0.1 1.0 Throughput (GOps/S) 𝑂 = 2 -. , 𝑀 = 6 π‘π‘œπ‘’ 𝑙 = 3

  14. CHALLENGES: β€’ Large volume of data β€’ Strict real-time processing constraints β€’ High accuracy demands β€’ Energy constraints, especially in embedded systems

  15. Modified Matrix-Form 1-D DWT Rearrangement

Recommend


More recommend